optimizing nbody on a GPU cluster with openacc - openacc

We are trying to provide a generic nbody algorithm for multiple Nodes.
A node has 2 GPUs and 1 CPU.
We want to calculate the n-body only on GPUs using openacc. After doing some research about openacc i am unsure how to spread the calculation to multiple GPUs.
Is it possible to use 2 GPUs with only one thread and openacc?
If not, what would be a suitable approch, using openMP to use both GPUs on one node
and communicate with other nodes via MPI?

The OpenACC runtime library provides routines (acc_set_device_num(), acc_get_device_num()) to select which accelerator device will be targetted by a particular thread, but it's not convenient to use a single thread to use multiple devices simultaneously. Instead, either OpenMP or MPI can be used.
For example (lifting from here) a basic framework for OpenMP might be:
#include <openacc.h>
#include <omp.h>
#pragma omp parallel num_threads(2)
{
int i = omp_get_threadnum();
acc_set_device_num( i, acc_device_nvidia );
#pragma acc data copy...
{
}
}
It can also be done with MPI, and/or you could use MPI to communicate between nodes, as is typical.

Related

Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!
The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.
If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

Running parallel OpenCL kernels

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.
In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.
My question is this:
How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?
Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.
Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?
(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)
Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".
This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.
The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.
How many kernels can a graphics card run in parallel?
You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.
If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).
Can each kernel allocate "local" memory?
Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.
In which case, how do you determine how many work-items can run in parallel?
Same answer as the first question assuming kernel instances.
Would this even require multiple kernels, could I use work-items instead?
You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:
void subKernel0( .... )
{
int gid = get_global_id(0);
// etc.
}
void subKernel1( .... )
{
int gid = get_global_id(0) - DISPATCH_SIZE_0;
// etc.
}
__kernel uberKernel( .... )
{
if( get_global_id(0) < DISPATCH_SIZE_0 )
{
subKernel0( .... );
}
else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
{
subKernel1( .... );
}
else if( .... )
{
// etc.
}
}
The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.
In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels.
Swapping of Kernels is why large Kernels are better than tiny Kernels.
Single-Kernel Client computeunits are the default.
Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.
Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.
Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

What is the difference between a serial code and using the keyword critical while parallelizing a code in openmp?

If I have just one for loop to parallelize and if I use #pragma omp critical while parallelizing, will that make it equivalent to a serial code?
No.
The critical directive specifies that the code it covers is executed by one thread at a time, but it will (eventually) be executed by all threads that encounter it.
The single directive specifies that the code it covers will only be executed by one thread but even that isn't exactly the same as compiling the code without OpenMP. OpenMP imposes some restrictions on what programming constructs can be used inside parallel regions (eg no jumping out of them). Furthermore, at run-time you are likely to incur an overhead for firing up OpenMP even if you don't actually run any code in parallel.

Splitting LAPACK calls with OpenMP

I am working on code-tuning a routine I have written and one part of it performs two matrix multiplications which could be done simultaneously. Currently, I call DGEMM (from Intel's MKL library) on the first DGEMM and then the second. This uses all 12 cores on my machine per call. I want it to perform both DGEMM routines at the same time, using 6 cores each. I get the feeling that this is a simple matter, but have not been able to find/understand how to achieve this. The main problem I have is that OpenMP must call the DGEMM routine from one thread, but be able to use 6 for each call. Which directive would work best for this situation? Would it require nested pragmas?
So as a more general note, how can I divide the (in my case 12) cores into sets which then run a routine from one thread which uses all threads in its set.
Thanks!
The closest thing that you can do is to have an OpenMP parallel region executing with a team of two threads and then call MKL from each thread. You have to enable nested parallelism in MKL (by disabling dynamic threads), fix the number of MKL threads to 6 and have to use Intel's compiler suite to compile your code. MKL itself is threaded using OpenMP but it's Intel's OpenMP runtime. If you happen to use another compiler, e.g. GCC, its OpenMP runtime might prove incompatible with Intel's.
As you haven't specified the language, I provide two examples - one in Fortran and one in C/C++:
Fortran:
call mkl_set_num_threads(6)
call mkl_set_dynamic(0)
!$omp parallel sections num_threads(2)
!$omp section
call dgemm(...)
!$omp end section
!$omp section
call dgemm(...)
!$omp end section
!$omp end parallel sections
C/C++:
mkl_set_num_threads(6);
mkl_set_dynamic(0);
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
cblas_dgemm(...)
}
#pragma omp section
{
cblas_dgemm(...)
}
}
In general you cannot create subsets of threads for MKL (at least given my current understanding). Each DGEMM call would use the globally specified number of MKL threads. Note that MKL operations might tune for the cache size of the CPU and performing two matrix multiplications in parallel might not be beneficial. You might be if you have a NUMA system with two hexacore CPUs, each with its own memory controller (which I suspect is your case), but you have to take care of where data is being placed and also enable binding (pinning) of threads to cores.

What is easier to learn and debug OpenMP or MPI?

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Resources