Altera OpenCL parallel execution in FPGA - parallel-processing

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!

The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.

If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

Related

Emulate a very fast (virtual) CPU core

I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism.
While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs?
In which way the parallel computational capabilities in a CPU are 'weaker' than the ones in a GPU?
Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w).
Simplifying:
CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs.
GPUs on the other hand parallelize by continuing the SIMD approach and executing the same program multiple times; both by pure SIMD where a set of programs execute in lock step (which is why branching is bad on a GPU, as both sides of an if statement must execute; and one result be thrown away so that the lock step programs proceed at the same rate); and also by single program, multiple data (SPMD) where groups of the sets of identical programs proceed in parallel but not necessarily in lock step.
The GPU approach is great where the exact same processing needs be applied to large volumes of data; for example a million vertices than need to be transformed in the same way, or many million pixels that need the processing to produce their colour. Assuming they don't become data block/pipeline stalled, GPUs programs general offer more predictable time bound execution due to its restrictions; which again is good for temporal parallelism e.g. the programs need to repeat their cycle at a certain rate for example 60 times a second (16ms) for 60 fps.
The CPU approach however is better for decisioning and performing multiple different tasks at the same time and dealing with changing inputs and requests.
Apart from its many other uses and purposes, the CPU is used to orchestrate work for the GPU to perform.
It's a similar idea, it goes kind of like this (very informally speaking):
The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them gives you access to more and more functions. You're limited by the register size and the larger data types you work with the less values you can use in parallel. You can freely mix and match SIMD instructions with traditional x86/x64 instructions.
The GPU lets you write your entire pipeline for each pixel of a texture. The texture size doesn't depend on your pipeline length, ie the number of values you can affect in one cycle isn't dependant on anything but your GPU, and the functions you can chain (your pixel shader) can be pretty much anything. It's somewhat more rigid though in that the setup and readback of your values is somewhat slower, and it's a one shot process (load values, run shader, read values), you can't massage them at all besides that, so you actually need to use a lot of values for it to be worth it.

Running parallel OpenCL kernels

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.
In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.
My question is this:
How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?
Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.
Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?
(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)
Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".
This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.
The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.
How many kernels can a graphics card run in parallel?
You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.
If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).
Can each kernel allocate "local" memory?
Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.
In which case, how do you determine how many work-items can run in parallel?
Same answer as the first question assuming kernel instances.
Would this even require multiple kernels, could I use work-items instead?
You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:
void subKernel0( .... )
{
int gid = get_global_id(0);
// etc.
}
void subKernel1( .... )
{
int gid = get_global_id(0) - DISPATCH_SIZE_0;
// etc.
}
__kernel uberKernel( .... )
{
if( get_global_id(0) < DISPATCH_SIZE_0 )
{
subKernel0( .... );
}
else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
{
subKernel1( .... );
}
else if( .... )
{
// etc.
}
}
The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.
In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels.
Swapping of Kernels is why large Kernels are better than tiny Kernels.
Single-Kernel Client computeunits are the default.
Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.
Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.
Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

OpenCL: work group concept

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?
Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.
There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.
Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.
One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

How to ensure that my workitems are running parallel?

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

Resources