Open CL: Can one local group be executed in multiple computing units? - parallel-processing

I am running an experiment with OpenCL on CPU. There is a loop inside my kernel where all threads in a local group are synchronized in the middle of each iteration and the end of each iteration. The reason I am doing this is because it seems to me the overhead of creating cl_mem objects and enqueueing the kernel in each iteration is greater than the benefit of parallelization.
I passed the kernel a local group size which equals to the global work size, for synchronization purpose. It seems to me the kernel is executed on one CPU core instead of on all CPU cores.
Can one local group be executed in multiple computing units? If not, is there anyway to keep synchronization between multiple computing units?

you can't synchronize different Work Groups during execution.
A compute unit is usually a group of processors which is able to communicate in order to synchronize its tasks.
For Example a Nvidia compute unit (streaming multiprocessor) has 8 streaming processors and it is only possible to synchronize the tasks runnin in this certain compute unit.
I would try to find out first, if you really have the overhead you mentioned:
How the synchronization works and how efficient it is towards simply enqueueing several kernels is dependant to the system you are using. So is a CPU is quite good at working through a lot of different kernels as you don't have to transfer them up to a GPU Memory or similar.
I would suggest you to benchmark it, OpenCL provides a powerful profiling functionality!
You have to initialise your Queue with CL_QUEUE_PROFILING_ENABLE, afterwards you can print profiling information for each event object like this for example (I am using the C++-Bindings):
std::vector<cl::Event> evts;
//[Create some event objects by enqueueing and executing kernels]
for (unsigned int i=0; i<evts.size(); i++) {
evts[i].getProfilingInfo(CL_PROFILING_COMMAND_QUEUED, &param);
printf("%u: %llu", i, param);
evts[i].getProfilingInfo(CL_PROFILING_COMMAND_SUBMIT, &param);
printf(" %llu", param);
evts[i].getProfilingInfo(CL_PROFILING_COMMAND_START, &param);
printf(" %llu", param);
evts[i].getProfilingInfo(CL_PROFILING_COMMAND_END, &param);
printf(" %llu\n", param);
}

Related

Optimal number of parallel processes for computation with a CPU with 6 cores and 12 threads

On a computer with an Intel CPU marketed "6 cores / 12 threads", I want to run as many processes as possible, each of them doing math similar computations (each process has a single thread) with different input data. There is no GPU involved, and no inter-process communication is needed.
What is the optimal number of parallel processes of the same executable doing math computations?
Should I run 6 processes (one per physical core)? Or 12 processes (one per thread / virtual core)?
If one process does, say, 1000 computations per second, I'm pretty sure that running 6 of them will run at ~1000/sec each (so a total of ~6000/sec).
But won't running 12 processes make them only 500 computations per second each?
TL;DR: should I run one process per "core" or one process per "thread" on a "6 cores/12 threads Intel CPU"?
It is very dependent of the actual computing code. Some application can benefit from hyper-threading while some do not. High-performance application rarely benefit from hyper-threading so using 1 process per core is certainly the best configuration assuming the code is compute bound and scale well.
Multiple hyper-threads of recent Intel processors (eg. Skylake/Icelake) can share some execution ports. As a result, the overall execution can be faster if one process is not able to saturate the ports. In practice, this is a bit more complex (modern processor are very complex) since compute-bound processes can be bound by other part of the processor like instruction decoding or more tricky low-level units.
For example, the following C code should benefit from hyper-threading (assuming no fast-math optimizations are applied and the code is compiler with optimizations):
float sum = 0.f;
for(int i=0 ; i<maxi ; ++i)
sum += array[i];
Indeed, the latency of a floating-point addition instruction is 3 to 4 cycles while generally 2 of them can be executed per cycle (only 1 before Skylake). This means the code is bound by the latency of the addition instruction chain. Hyper-threads can use the waiting execution port during this time resulting in a up to twice faster execution (other bottleneck cause the execution not to be so fast in practice). If the code is optimized with fast-math optimization, then compilers can unroll the loop and make use of instruction-level parallelism (IPC). A low IPC often means that using hyper-thread may be beneficial, especially if the cause of this low IPC is due to latency issues (eg. instruction latency and cache misses). Unfortunately, this is not always true. For example, the following code should not be faster with hyper-threading:
for(int i=0 ; i<maxi ; ++i)
out_array[i] += in_array[i];
This is because there is generally 1 execution store port on Intel processor and it should already be saturated with 1 hyper-thread (otherwise it should be memory throughput bound which is not better for hyper-threading). Thus using more hyper-thread should not improve the execution time. In fact, hyper-threading introduces a slight overhead that should cause a slightly slower execution.
The thing is applications are generally much more complex than that and one does not know how math functions are implemented. A a result, this is nearly impossible for a developer to know what is the best configuration without a basic benchmark unless the computing kernel is simple.

__threadfence implies the effect of __syncthreads?

I'm implementing parallel reduction in CUDA.
The kernel has a __syncthreads to wait for all threads to complete 2 reads from shared memory, which would then write back the sum to the shared memory.
Should I use a __threadfence_block to ensure that writes to shared memory are visible to all threads for the next iteration , or use __syncthreads as given in NVIDIA's example ?
__syncthreads() implies a memory fence function as well. This is covered in the documentation:
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
So in this case it would not be necessary to use __threadfence_block() in addition to __syncthreads()
You cannot substitute a threadfence function for the execution barrier in the usual general parallel reduction. The execution barrier (__syncthreads()) is required in addition to the memory fencing function. In the general case, it's generally necessary to wait for all threads to execute a given round of reduction before proceeding with the next round; __threadfence_block() by itself will not force warps to wait while other warps are executing a given round of reduction.
Therefore __syncthreads() is generally required, and assuming you have used it properly, the __threadfence_block() is generally not required.
__syncthreads() implies __threadfence_block().
__threadfence_block() does not imply __syncthreads()

Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!
The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.
If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

Running parallel OpenCL kernels

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.
In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.
My question is this:
How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?
Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.
Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?
(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)
Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".
This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.
The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.
How many kernels can a graphics card run in parallel?
You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.
If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).
Can each kernel allocate "local" memory?
Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.
In which case, how do you determine how many work-items can run in parallel?
Same answer as the first question assuming kernel instances.
Would this even require multiple kernels, could I use work-items instead?
You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:
void subKernel0( .... )
{
int gid = get_global_id(0);
// etc.
}
void subKernel1( .... )
{
int gid = get_global_id(0) - DISPATCH_SIZE_0;
// etc.
}
__kernel uberKernel( .... )
{
if( get_global_id(0) < DISPATCH_SIZE_0 )
{
subKernel0( .... );
}
else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
{
subKernel1( .... );
}
else if( .... )
{
// etc.
}
}
The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.
In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels.
Swapping of Kernels is why large Kernels are better than tiny Kernels.
Single-Kernel Client computeunits are the default.
Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.
Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.
Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

How to ensure that my workitems are running parallel?

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

Resources