Running parallel OpenCL kernels - parallel-processing

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.
In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.
My question is this:
How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?
Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.
Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?
(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)

Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".
This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.
The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.

How many kernels can a graphics card run in parallel?
You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.
If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).
Can each kernel allocate "local" memory?
Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.
In which case, how do you determine how many work-items can run in parallel?
Same answer as the first question assuming kernel instances.
Would this even require multiple kernels, could I use work-items instead?
You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:
void subKernel0( .... )
{
int gid = get_global_id(0);
// etc.
}
void subKernel1( .... )
{
int gid = get_global_id(0) - DISPATCH_SIZE_0;
// etc.
}
__kernel uberKernel( .... )
{
if( get_global_id(0) < DISPATCH_SIZE_0 )
{
subKernel0( .... );
}
else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
{
subKernel1( .... );
}
else if( .... )
{
// etc.
}
}
The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.

In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels.
Swapping of Kernels is why large Kernels are better than tiny Kernels.
Single-Kernel Client computeunits are the default.
Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.
Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.
Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

Related

Are OpenCL workgroups executed simultaneously?

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!
The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.
If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

How to ensure that my workitems are running parallel?

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

is it possible to use same memory buffers for different kernels in OpenCL?

I am implementing a kernel function in which the memory from the host side is transferred to kernel.The kernel has three functions.. Is it possible to share the same memory buffers with the kernels at different times ??
Yes, multiple kernels can use the same memory objects, as long as there is no risk for the kernels to be executed at the same time. It is the case for the usual "single command queue not created with out of order execution".
Yes, I do this with my ray tracer. I have three kernels. A preprocessor which changes geometry, a ray tracer , and a post processor which does image processing. I share memory buffers with all three of them. I make sure the kernels finish before I start the next one.
You can share memory without any problem. If the memory is read only you can even use that memory object as an input for 2 kernels running concurrently (ie: different GPUs/same context).
However, if you want to overwrite the memory zones, then be careful and use events to sync your kernels. I strongly recomend the events mechanism, since it enables parallel I/O read and writes to the memory zones in another queue.

How are light weight threads scheduled by the linux kernel on a multichip multicore SMP system?

I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)

Resources