OpenCL - Performance - performance

I'm working with OpenCL and I work with a matrix that I increase its values, and I need the application time to be as low as possible. What is the best way to improve performance with OpenCL? I've read something about data parallelism and task parallelism, but I do not know them very well.
I'm working with a 64x56 matrix. Using task parallelism I have create 64 kernels functions. One kernel for each column, but I think that I could do it much better.

If you are executing the kernel on GPU, it might be better to make one thread handle one item. However, it depends on what exactly you are doing with the elements of the matrix, e.g. how many operations you perform on each of them.
If you just increase the elements by some numbers, it might not be beneficial.
In general, there are 3 options:
One thread works with the whole matrix. This way there is no parallelism, and it's bad for GPU.
One thread works with one row/column. -> 64/56 threads are used, global work size equals 64 or 56.
One thread works with a single element. -> 3584 threads are used, global work size is {64, 56}.
Have you tried using just one kernel, that handles one element, and call clEnqueueNDRangeKernel for it with the global work size equal {64, 56}? How does it affect the execution time?

Related

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism.
While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs?
In which way the parallel computational capabilities in a CPU are 'weaker' than the ones in a GPU?
Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w).
Simplifying:
CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs.
GPUs on the other hand parallelize by continuing the SIMD approach and executing the same program multiple times; both by pure SIMD where a set of programs execute in lock step (which is why branching is bad on a GPU, as both sides of an if statement must execute; and one result be thrown away so that the lock step programs proceed at the same rate); and also by single program, multiple data (SPMD) where groups of the sets of identical programs proceed in parallel but not necessarily in lock step.
The GPU approach is great where the exact same processing needs be applied to large volumes of data; for example a million vertices than need to be transformed in the same way, or many million pixels that need the processing to produce their colour. Assuming they don't become data block/pipeline stalled, GPUs programs general offer more predictable time bound execution due to its restrictions; which again is good for temporal parallelism e.g. the programs need to repeat their cycle at a certain rate for example 60 times a second (16ms) for 60 fps.
The CPU approach however is better for decisioning and performing multiple different tasks at the same time and dealing with changing inputs and requests.
Apart from its many other uses and purposes, the CPU is used to orchestrate work for the GPU to perform.
It's a similar idea, it goes kind of like this (very informally speaking):
The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them gives you access to more and more functions. You're limited by the register size and the larger data types you work with the less values you can use in parallel. You can freely mix and match SIMD instructions with traditional x86/x64 instructions.
The GPU lets you write your entire pipeline for each pixel of a texture. The texture size doesn't depend on your pipeline length, ie the number of values you can affect in one cycle isn't dependant on anything but your GPU, and the functions you can chain (your pixel shader) can be pretty much anything. It's somewhat more rigid though in that the setup and readback of your values is somewhat slower, and it's a one shot process (load values, run shader, read values), you can't massage them at all besides that, so you actually need to use a lot of values for it to be worth it.

How to ensure that my workitems are running parallel?

CL_DEVICE_NAME = GeForce GT 630
CL_DEVICE_TYPE = CL_DEVICE_TYPE_GPU
CL_PLATFORM_NAME : NVIDIA CUDA
size_t global_item_size = 8;
size_t local_item_size = 1;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
Here, printing in the kernel is not allowed. Hence, how to ensure that all my 8 cores are running in parallel?
Extra info (regarding my question): for kernel, i am passing input and and output array of 8X8 size as a buffer. According to workitem number, i am solving that row and saving the result in output buffer. and after that i am reading the result.
If i am running AMD platform SDK, where i add print statement in kernel by
#pragma OPENCL EXTENSION cl_amd_printf : enable
hence i can see clearly, if i am using 4 core machine, my first 4 cores are running parallel and then rest will run in parallel, which shows it is solving maximum 4 in parallel.
But, how can i see the same for my CL_DEVICE_TYPE_GPU?
Any help/pointers/suggestions will be appreciated.
Using printf is not at all a reliable method of determining if your code is actually executing in parallel. You could have 4 threads running concurrently on a single core for example, and would still have your printf statements output in a non-deterministic order as the CPU time-slices between them. In fact, section 6.12.13.1 of the OpenCL 1.2 specification ("printf output synchronization") explicitly states that there are no guarantees about the order in which the output is written.
It sounds like what you are really after is a metric that will tell you how well your device is being utilised, which is different than determining if certain work-items are actually executing in parallel. The best way to do this would be to use a profiler, which would usually contain such a metric. Unfortunately NVIDIA's NVVP no longer works with OpenCL, so this doesn't really help you.
On NVIDIA hardware, work-items within a work-group are batched up into groups of 32, known as a warp. Each warp executes in a SIMD fashion, so the 32 work-items in the warp execute in lockstep. You will typically have many warps resident on each compute unit, potentially from multiple work-groups. The compute unit will transparently context switch between these warps as necessary to keep the processing elements busy when warps stall.
Your brief code snippet indicates that you are asking for 8 work-items with a work-group size of 1. I don't know if this is just an example, but if it isn't then this will almost certainly deliver fairly poor performance on the GPU. As per the above, you really want the work-group size to be multiple of 32, so that the GPU can fill each warp. Additionally, you'll want hundreds of work-items in your global size (NDRange) in order to properly fill the GPU. Running such a small problem size isn't going to be very indicative of how well your GPU can perform.
If you are enqueueing enough work items (at least 32 but ideally thousands) then your "workitems are running parallel".
You can see details of how your kernel is executing by using a profiling tool, for example Parallel Nsight on NVIDIA hardware or CodeXL on AMD hardware. It will tell you things about hardware occupancy and execution speed. You'll also be able to see memory transfers.

My program use only 25% of cpu power

My program with single thread uses only 25% of CPU with 2 cores (intel i5-3210M). Why not 50% (one core)? Program is being tested on macbook pro with windows 7 64. I think that problem is hyper-threading and because of this program uses only one logical core (25% of cpu power). How can I give more CPU power to my program?
It's important for me because this program works with big set of data and it takes about 30 hours to finish calculations.
It is expectable as you said with your CPU(which has 4 logical processors). You can search for the ways of transforming your program in order to use more than one threads. I can recommend you to search for "parallel programming", "concurrent programming","multi-threading". if you are using MS VC++ PPL library is so easy to use..OpenMP is a more prowerful tool which is available in Linux also. There are lots more ways and libraries for this issue but you need to choose it according to your OS, compiler, environment, programming language and your problem.
However, the easiest solution is to run it on a desktop machine with a better CPU and cross your fingers to get the results as quick as possible.
This program uses only one logical core (25% of cpu power). How can I give more CPU power to my programm? ...this programm works with big set of data ... it takes about 30 hours to finish calculations.
Divide up your data set into (at least) 4 separate pieces. With that much data, you want to think in terms of indexes into the data instead of copying data elements to 4 separate structures. Create a separate thread for each segment of your data, and have that thread only process one segment. You may need to set a processor affinity for your threads.
If the data streams, or must be processed in order, think in terms of queing elements for processing, where individual threads will then dequeue and process each item. This works well when the enqueue operation is relatively fast compared to processing an item, and can be done by a single master thread, while each dequeue/processing operation is more expensive.
Choosing the correct number of threads is tricky. Modern CPUs and operating systems are designed to switch tasks from time to time. This will always be an expensive operation, but the scheduler will want to do something else every so often, even if your process may seem like the best candidate. Therefore, you can often get the best throughput by overloading your CPUs to a small extent, such that you may want two or three threads per logical cpu. One way to manage this is through use the ThreadPool object.

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?
You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.
NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.
I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.
Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.
The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/
If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

CUDA: Bigger problems in threads

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?
For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?
CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.
In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.
The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.
I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!
As far a practical considerations, I would just add a few small notes
long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

Resources