How to compare OpenCL with native code performance properly? - performance

Intel provides some advices on comparing OpenCL with native code here.
Are there any additional recommendations? I am particularly interested whether OpenCL is usually compared with straightforward C/C++ code or if optimizazions of the sequential code are also taken into account. What is the case with intrinsic functions?

Related

What are the possibilities of combining CUDA, OpenCL and OpenACC in the same program?

Each language offers its advantages and disadvantages, but what advantages does it offer to combine them all?
OpenACC apparently has some degree of interoperability with CUDA. OpenCL, on the other hand, has no way of working with either OpenACC or CUDA. So there is no way to do what you ask about, irrespective of the perceived benefits of being able to do so.
In general, use OpenACC for your high level development and data management within standard C/C++ and Fortran. Then if you need to have a higher degree of control over a kernel (i.e. if you think you can get better performance at the cost of loosing some portability), then you can code the kernel in the lower level models of CUDA or OpenCL. But you can't really do all of them at the same time.

Scientific library in C/C with OpenMP

I have to exploit PpenMP in some algorithm and for this purpose I need some mathematical functions, like eig or svd as it is available in MATLAB and it is quite fast in MATLAB. I already tried the following libraries with OpenMP
GSL - GNU Scientific Library
Eigen C++ template library
but I don't know why my OpenMP parallelised code is much slower than the serial code, may be there is some thing wrong in the library, or that the function random, eig or svd are blocking? I have no idea how to figure it out, can some body suggest me which is most compatible math library with OpenMP.
I can recommend Intel's MKL; note that it costs money which may affect your decision. I neither know nor care what language(s) it is written in, just so long as it provides APIs callable from my chosen language. Mine is Fortran, but it has bindings for C too
If you look around SO you'll find many questions from people whose first (or second or third) OpenMP programs were actually slower than their serial versions. Look at some of the answers. Don't conclude that there is a magic bullet, in the shape of a library, to make your code faster. Instead, realise that it is most likely that you've written a poorly-parallelised program and fix that.
Finally, if you have an installation of Matlab, don't expect to be able to write your own routines to outperform Matlab's. I won't say it can't be done, but I think you'll find it very difficult.
GSL is compatible with OpenMP. You can try with Intel Math Kernel Library which comes as a trial version for free.
If the speed up is not so much, then probably the code is not much parallelizable. You may want to debug and see the details of the running threads in Intel Thread Checker, that could be helpful to see where the bottlenecks are.
I think you just want to find a fast implementation of lapack (or related routines) which is already threaded, but it's a little hard to tell from your question. High Performance Mark suggests MKL, which is an excellent example; others include ATLAS or FLAME which are open source but take some doing to build.

GPU Programming?

I'm new to the GPU Programming world, I've tried reading on Wikipedia and Googling, but I still have several questions:
I downloaded some GPU Examples, for CUDA, there were some .cu files and some CPP files, but all the code was normal C/C++ Code just some weird functions like cudaMemcpyToSymbol and the rest was pure c code. The question is, is the .cu code compiled with nvcc and then linked with gcc? Or how is it programmed?
if I coded something to be run on GPU, will it run on ALL GPUs? or just CUDA? or is there a method to write for CUDA and a Method to write for ATI and a method to write for both?
To answer your second question:
OpenCL is the (only) way to go if you want to write platform independent GPGPU code.
ATIs website actually has a lot of resources for OpenCL if you search a little, and their example projects are very easy to modify into what you need, or just to understand the code.
The OpenCL spec and reference pages is also a very good source of knowledge:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
There are a lot of talks that explain some of the core concepts, and also that explain how to write fast code that I would recommend (that is applicable to CUDA too).
To almost answer your first question:
In OpenCL, the code is compiled at runtime to the specific GPU you're using (to guarantee speed).
You probably want to do some background reading on CUDA - it's not something you can just pick up by looking at a few code samples. There are about 3 different CUDA books on Amazon now, and there is a lot of reference material at http://developer.nvidia.com.
To answer your questions:
yes, .cu files are compiled with nvcc to an intermediate form (PTX) - this is subsequently converted to GPU-specific code at run-time
the generated code will run on a subset of nVidia GPUs, the size of the subset depending on what CUDA capabilities you use in your code
completing the answer given by #nulvinge, I'd say that OpenCL its to GPU Programming like OpenGL is to GPU Rendering. But its not the only option for multi-architecture development, you could also use DirectCompute, but I wouldn't say that its the best option, just if you want your code running on every DirectX11 compatible GPUs, that includes some intel graphics cards chips too right?
But even if you are thinking in doing some GPU programming with OpenCL, do not forget to study the architecture of the platforms that you're using. ATI CPUs, GPUs and NVIDIA GPUs have big differences and your code is needed to be tuned for each platform that you're using if you want to get the most of it...
Fortunately both NVIDIA and AMD have Programming Guides to help you:)
In addition to previous answers, for CUDA you would need a NVIDIA card/GPU, unless you have access for a remote one, which I would recommend this course from Coursera:
Heterogeneous Parallel Programming
It not just gives an introduction to CUDA and OpenCL, memory model, tiling, handling boundary conditions and performance considerations, but also directive-based languages such as OpenACC, a high level language for expressing parallelism into your code, leaving mostly of the parallel programming work for the compiler (good to start with). Also, this course has a online platform where you can use their GPU, which is good to start GPU programming without concerning about software/hardware setup.
If you want to write a portable code which you can execute on different GPU devices and also on CPUs. You need to use OpenCL.
Actually, to configure your kernel you need to write a host code in C. The configuration file might be shorter if you want to write it for CUDA kernels comparing to OpenCL one.

Performance of Google's Go?

So has anyone used Google's Go? I was wondering how the mathematical performance (e.g. flops) is compared to other languages with a garbage collector... like Java or .NET?
Has anyone investigated this?
Theoretical performance: The theoretical performance of pure Go programs is somewhere between C/C++ and Java. This assumes an advanced optimizing compiler and it also assumes the programmer takes advantage of all features of the language (be it C, C++, Java or Go) and refactors the code to fit the programming language.
Practical performance (as of July 2011): The standard Go compiler (5g/6g/8g) is currently unable to generate efficient instruction streams for high-performance numerical codes, so the performance will be lower than C/C++ or Java. There are multiple reasons for this: each function call has an overhead of a couple of additional instructions (compared to C/C++ or Java), no function inlining, average-quality register allocation, average-quality garbage collector, limited ability to erase bound checks, no access to vector instructions from Go, compiler has no support for SSE2 on 32-bit x86 CPUs, etc.
Bottom line: As a rule of thumb, expect the performance of numerical codes implemented in pure Go, compiled by 5g/6g/8g, to be 2 times lower than C/C++ or Java. Expect the performance to get better in the future.
Practical performance (September 2013): Compared to older Go from July 2011, Go 1.1.2 is capable of generating more efficient numerical codes but they remain to run slightly slower than C/C++ and Java. The compiler utilizes SSE2 instructions even on 32-bit x86 CPUs which causes 32-bit numerical codes to run much faster, most likely thanks to better register allocation. The compiler now implements function inlining and escape analysis. The garbage collector has also been improved but it remains to be less advanced than Java's garbage collector. There is still no support for accessing vector instructions from Go.
Bottom line: The performance gap seems sufficiently small for Go to be an alternative to C/C++ and Java in numerical computing, unless the competing implementation is using vector instructions.
The Go math package is largely written in assembler for performance.
Benchmarks are often unreliable and are subject to interpretation. For example, Robert Hundt's paper Loop Recognition in C++/Java/Go/Scala looks flawed. The Go blog post on Profiling Go Programs dissects Hundt's claims.
You're actually asking several different questions. First of all, Go's math performance is going to be about as fast as anything else. Any language that compiles down to native code (which arguably includes even JIT languages like .NET) is going to perform extremely well at raw math -- as fast as the machine can go. Simple math operations are very easy to compile into a zero-overhead form. This is the area where compiled (including JIT) languages have a advantage over interpreted ones.
The other question you asked was about garbage collection. This is, to a certain extent, a bit of a side issue if you're talking about heavy math. That's not to say that GC doesn't impact performance -- actually it impacts quite a bit. But the common solution for tight loops is to avoid or minimize GC sweeps. This is often quite simple if you're doing a tight loop -- you just re-use your old variables instead of constantly allocating and discarding them. This can speed your code by several orders of magnitude.
As for the GC implementations themselves -- Go and .NET both use mark-and-sweep garbage collection. Microsoft has put a lot of focus and engineering into their GC engine, and I'm obliged to think that it's quite good all things considered. Go's GC engine is a work in progress, and while it doesn't feel any slower than .NET's architecture, the Golang folks insist that it needs some work. The fact that Go's specification disallows destructors goes a long way in speeding things up, which may be why it doesn't seem that slow.
Finally, in my own anecdotal experience, I've found Go to be extremely fast. I've written very simple and easy programs that have stood up in my own benchmarks against highly-optimized C code from some long-standing and well-respected open source projects that pride themselves on performance.
The catch is that not all Go code is going to be efficient, just like not all C code is efficient. You've got to build it correctly, which often means doing things differently than what you're used to from other languages. The profiling blog post mentioned here several times is a good example of that.
Google did a study comparing Go to some other popular languages (C++, Java, Scala). They concluded it was not as strong performance-wise:
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
Quote from the Conclusion, about Go:
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reflects in both performance and binary sizes.

GPU Programming

I want to do some GPU programming. What's the way to go here? I want to learn something that is "open" , cross platform and a "higher" language. I don't want to be lock into just GPU vendor nor OS, platform, etc.
What are my choices here? Cuda, OpenCL, OpenMP, other? What's the pros/cons for them?
What about G/HLSL and PhysX?
I'm looking at doing "general purpose" programming, some math, number crunching, simulations, etc. Maybe spit out some pretty graphics, but not specifically graphics programming.
The answer marked correct is now outdated and incorrect. In particular OpenMP 4.0 supports GPU acceleration.
OpenMP is cpu only, but easy to implement, CUDA is basically GPU only. Ati Stream supports both, but only on Ati/AMD gpu's. OpenCL is your only "open" option that supports both.
Nowadays - 2013/2014 - there is C++ Accelerated Massive Parallelism (AMP) of Microsoft. This is a high level language that compiles to High Level Shader Language (HLSL) so you do not have to write kernel code etc.
'How to learn' is found here (click!)
An introduction video is found here (click!)
A simple and easy to read comparison between OpenCL and C++ AMP is done by the AMD folks and is found here (click!).
The GPU support for openMP will be available in near future:
http://openmp.org/sc14/Booth-Sam-IBM.pdf
If you want to get involve with GPU and at higher level than OpenCL you might have a look to Matlab. There is a chance to program GPUs via Matlab and you do not need to learn lower models such as OpenCL and CUDA.
CUDA it will be more efficient as you probably are going to program a NVIDIA card. However, openCL is the standard for GPGPUs and the way to code is pretty similar. Although you might find not very difficult to use CUDA or openCL, you will really find much harder to optimize them.
I hope it helps.
Open CL is open but I've heard that a downside to this is the lack of documentation. ATI might be a better between NVIDIA and ATI since it was reportedly faster in 2009 but I'm not sure if those stats are still correct.

Resources