How to ensure that my workitems are running parallel?

How to ensure that my workitems are running parallel? - parallel-processing

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.

Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.

If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

Related

GPU programming model - how many simultaneous, divergent threads without penalty

I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can be so executed where divergence doesn't cause this penality... i.e. what number is it that I should look for in the spec sheet. Is it the number of "shader processors" or the number of "Streaming multiprocessors" that is relevant here?
Also, the same question for AMD Radeon: Here the relevant terms might be "unified shaders" and "compute units".
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.

I am new to GPGPU and am instead learning OpenCL. But this question has remained unanswered for months, so I'll have a stab at it (and hopefully an expert will correct me if I'm wrong).
However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources
Not necessarily. On AMD systems, only 64 work-items (called Threads in CUDA) are worked on at any given time (technically: each VALU in AMD systems works on 16 items at once, but any given instruction is repeated four-times, every time. So 64-items per "AMD Wavefront"). On NVidia systems, it seems like 32-threads are executed at a time per warp.
Of course, the "Block Size" is likely far larger than 64. So if you were doing 32x32 pixel blocks, you'd need 1024 cores / shaders / work items per work group (OpenCL) or Warp.
These 1024 threads CAN diverge without penalty under NVidia Pascal, because they're split into sets of 32.
So if you have a work group / warp size of 1024, correlating to 32x32 block of pixels... the first two rows will execute on one VALU (AMD GCN) or SM (NVidia Pascal). As long as ALL of those 32 threads / 64-work items take the same branches, you won't have any penalties.
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.
if( threadid> 0) {
} else {
dostuff();
}
Honestly, I think its best if you just diverge and hope for the best. All of those cores have resources of their own (Registers and stuff).

Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!

The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.

If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism.
While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs?
In which way the parallel computational capabilities in a CPU are 'weaker' than the ones in a GPU?

Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w).
Simplifying:
CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs.
GPUs on the other hand parallelize by continuing the SIMD approach and executing the same program multiple times; both by pure SIMD where a set of programs execute in lock step (which is why branching is bad on a GPU, as both sides of an if statement must execute; and one result be thrown away so that the lock step programs proceed at the same rate); and also by single program, multiple data (SPMD) where groups of the sets of identical programs proceed in parallel but not necessarily in lock step.
The GPU approach is great where the exact same processing needs be applied to large volumes of data; for example a million vertices than need to be transformed in the same way, or many million pixels that need the processing to produce their colour. Assuming they don't become data block/pipeline stalled, GPUs programs general offer more predictable time bound execution due to its restrictions; which again is good for temporal parallelism e.g. the programs need to repeat their cycle at a certain rate for example 60 times a second (16ms) for 60 fps.
The CPU approach however is better for decisioning and performing multiple different tasks at the same time and dealing with changing inputs and requests.
Apart from its many other uses and purposes, the CPU is used to orchestrate work for the GPU to perform.

It's a similar idea, it goes kind of like this (very informally speaking):
The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them gives you access to more and more functions. You're limited by the register size and the larger data types you work with the less values you can use in parallel. You can freely mix and match SIMD instructions with traditional x86/x64 instructions.
The GPU lets you write your entire pipeline for each pixel of a texture. The texture size doesn't depend on your pipeline length, ie the number of values you can affect in one cycle isn't dependant on anything but your GPU, and the functions you can chain (your pixel shader) can be pretty much anything. It's somewhat more rigid though in that the setup and readback of your values is somewhat slower, and it's a one shot process (load values, run shader, read values), you can't massage them at all besides that, so you actually need to use a lot of values for it to be worth it.

Does OpenCL workgroup size matter in the OS X CPU runtime?

In the OS X OpenCL CPU runtime, the documentation here indicates that "Work items are scheduled in different tasks submitted to Grand Central Dispatch". That would seem to indicate that workgroups are essentially a no-op, and you should shoot for (number of work items) = (number of hardware threads) with (number of workgroups) being irrelevant. However, on other implementations, there are low-cost switches between items in the same workgroup via essentially coroutines (setjmp and longjmp), which would make it much less expensive to schedule more work-items (since you avoid a full OS-managed thread context switch between items), which in turn would make it easier to reuse code between CPU and GPU targets. According to "Heterogeneous Computing with OpenCL", AMD's CPU runtime does this, and I vaguely recall some documentation indicating the same is true for Intel's CPU runtime.
Can anyone confirm the behavior of workgroups in the OS X CPU runtime?

As stated later in the document (see the Autovectorizer section), workgroup size on CPU is linked to autovectorized code.
The autovectorizer aggregates several consecutive work-items into a single kernel function calling vector instructions (SSE, AVX) as much as possible.
Setting the workgroup size to 1 disables the autovectorizer. Larger values will enable vector code when available. In most cases, the generated code is able to efficiently use all the CPU resources.
In all cases, OpenCL on CPU runs on a small number of hardware threads.
Update: to answer the question in the comments.
It works usually quite well. Start with a "scalar" kernel and benchmark it to see the speedup provided by the autovectorizer, then "hand-vectorize" only if the speedup is not good enough. To help the compiler, avoid using "if", and prefer conditional assignments and bit operations.

Does modern GPU (e.g Fermi/Evergreen) supports out of order execution?

I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.

Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.

NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio