Exploitation of GPU using Halide - halide

I'm implementing an algorithm using Halide while comparing hand-tuned(using CUDA) version of same algorithm.
Acceleration of the Halide implementation mostly went well, but still slower a bit than hand-tuned version. So I tried to see exact execution time of each Func using nvvp(nvidia visual profiler). By doing that, I figured out that hand-tuned implementation overlaps multiple function's(they're similar) execution which is implemented as a Func in Halide implemetation. Cuda's Stream technology is used to do it.
I would like to know whether I can do similar exploitation of GPU in Halide or not.
I appreciate for reading.

Currently the runtime has no support for CUDA streams. It might be possible to replace the runtime with something that can do this, but there is no extra information passed in to control the concurrency. (The runtime is somewhat designed to be replaceable, but there is a bit of a notion of a single queue and full dependency information is not passed down. It may be possible to reconstruct the dependencies from the inputs and outputs, but that starts to be a lot of work to solve a problem the compiler should be solving itself.)
We're talking about how to express such control in the schedule. One possibility is to use the support being prototyped in the async branch to do this, but we haven't totally figured out how to apply this to GPUs. (The basic idea is scheduling a Func async on a GPU would put it on a different stream. We'd need to use GPU synchronization APIs to handle producer/consumer dependencies.) Ultimately this is something we are interested in exploiting, but work needs to be done.

Related

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

Performance optimization in CUDA - Which of these algorithms should I use?

I have an algorithm which consists two major tasks. Both tasks are embarrassingly parallel. So I can port this algorithm on CUDA by one of the following way.
>Kernel<<<
Block,Threads>>>() \\\For task1
cudaThreadSynchronize();
>Kerne2<<<
Block,Threads>>>() \\\For task2
Or I can do following thing.
>Kernel<<<
Block,Threads>>>()
{
1.Threads work on task 1.
2.syncronizes across device.
3.Start for task 2.
}
One can note that in first method, we'll have to come back to CPU while in second trend we'll have to use synchronization across all blocks in CUDA. Paper in IPDPS 10 says that second method, with proper care can perform better. But in general which method should be followed?
There is not currently any officially supported method for synchronizing across thread blocks withing a single kernel execution in the CUDA programming model. Methods of doing so, in my experience, lead to brittle code that can lead to incorrect behavior under changing circumstances such as running on different hardware, changing driver and CUDA release versions, etc.
Just because something is published in an academic publication does not mean it is a safe idea for production code.
I recommend you stick with your method 1, and I ask you this: have you determined that separating your computation into two separate kernels is really causing a performance problem? Is the cost of a second kernel launch definitely the bottleneck?

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Fastest math programming language?

I have an application that requires millions of subtractions and remainders, i originally programmed this algorithm inside of C#.Net but it takes five minutes to process this information and i need it faster than that.
I have considered perl and that seems to be the best alternative now. Vb.net was slower in testing. C++ may be better also. Any advice would be greatly appreciated.
You need a compiled language like Fortran, C, or C++. Other languages are designed to give you flexibility, object-orientation, or other advantages, and assume absolutely fastest performance is not your highest priority.
Know how to get maximum performance out of a single thread, and after you have done so investigate sharing the work across multiple cores, for example with MPI. To get maximum performance in a single thread, one thing I do is single-step it at the machine instruction level, to make sure it's not dawdling about in stuff that could be removed.
Some calculations are regular enough to take profit of GPGPUs: recent graphic cards are essentially specialized massively parallel numerical co-processors. For instance, you could code your numerical kernels in OpenCL. Otherwise, learn C++11 (not some earlier version of the C++ standard) or C. And in many cases Ocaml could be nearly as fast as C++ but much easier to code with.
Perhaps your problem can be handled by scilab or R, I did not understand it enough to help more.
And you might take advantage of your multi-core processor by e.g. using Pthreads or MPI
At last, the Linux operating system is perhaps better to deal with massive calculations. It is significant that most super computers use it today.
If execution speed is the highest priority, that usually means Fortran.
Try Julia: its killing feature is being easy to code in a high level concise way, while keeping performances at the same order of magnitude of Fortran/C.
PARI/GP is the best I have used so far. It's written in C.
Try to look at DMelt mathematical program. The program calls Java libraries. Java virtual machine can optimize long mathematical calculations for you.
The standard tool for mathmatic numerical operations in engineering is often Matlab (or as free alternatives octave or the already mentioned scilab).

When not to use MPI

This is not a question on specific technical coding aspect of MPI. I am NEW to MPI, and not wanting to make a fool of myself of using the library in a wrong way, thus posting the question here.
As far as I understand, MPI is a environment for building parallel application on a distributed memory model.
I have a system that's interconnected with Infiniband, for the sole purpose of doing some very time consuming operations. I've already broke out the algorithm to do it in parallel, so I am really only using MPI to transmit data (results of the intermediate steps) between multiple nodes over Infiniband, which I believe one can simply use OpenIB to do.
Am I using MPI the right way? Or am I bending the original intention of the system?
Its fine to use just MPI_Send & MPI_Recv in your algorithm. As your algorithm evolves, you gain more experience, etc. you may find use for the more "advanced" MPI features such as barrier & collective communication such as Gather, Reduce, etc.
The fewer and simpler the MPI constructs you need to use to get your work done, the better MPI is a match to your problem -- you can say that about most libraries and lanaguages, as a practical matter and argualbly an matter of abstractions.
Yes, you could write raw OpenIB calls to do your work too, but what happens when you need to move to an ethernet cluster, or huge shared-memory machine, or whatever the next big interconnect is? MPI is middleware, and as such, one of its big selling points is that you don't have to spend time writing network-level code.
At the other end of the complexity spectrum, the time not to use MPI is when your problem or solution technique presents enough dynamism that MPI usage (most specifically, its process model) is a hindrance. A system like Charm++ (disclosure: I'm a developer of Charm++) lets you do problem decomposition in terms of finer grained units, and its runtime system manages the distribution of those units to processors to ensure load balance, and keeps track of where they are to direct communication appropriately.
Another not-uncommon issue is dynamic data access patterns, where something like Global Arrays or a PGAS language would be much easier to code.

Resources