Effect of visual studio compiler settings on performance of CUDA kernels - visual-studio-2010

I get about 3-4x times difference in computation time of a same CUDA kernel compiled on two different machines. Both versions run on a same machine and GPU device. The direct conclusion explaining the difference is different compiler settings. Although there is no single perfect setting and the tuning should be customized depending on the kernel, I wonder if there is any clear guideline for helping to choose the right settings. I use Visual Studio 2010. Thank you.

Compile in release mode, not debug mode, if you want fastest performance. The -G switch passed to the nvcc compiler will usually have a negative effect on GPU code performance.
It's generally recommended to select the right architecture for the GPU you are compiling for. For example, if you have a cc 2.1 capability GPU, make sure that setting (sm_21, in GPU code settings) is being passed to the compiler. There are some counter examples to this (e.g. compiling for cc 2.0 seems to run faster, etc.) but as a general recommendation, it is best.
Use the latest version of CUDA (compiler). This is especially important when using GPU libraries (CUFFT, CUBLAS, etc.) (yes, this is not really a compiler setting)

Related

Advantages of intel oneAPI versus the older Parallel Studio XE (Fortran user)

I program numerical models to solve Partial Differential Equations in Fortran, in serial and parallel (with MPI). Only Fortran, I do not know/need other languages. I see I now need to migrate to Intel OneAPI, before I had Parallel Studio XE 2019. Any advantage/new feature after migrating to OneAPI that a Fortran average user will enjoy? I never used GPUs, will OneAPI make the transition easier if in the future I wanna learn how to parallelize the code and run it on GPU?
There is no necessity for a transition to speak about. You will get the same ifort you had before, just an updated version. And an option to try the new (LLVM-based) ifx, but it is just an option, I did not use it yet either.
The ifort compiler is the same compiler you are used to, just updated to (as of now) version 2021.2 with various small improvements and bug fixes as always with their new version.
If you do want to try new ifx, it indeed comes with new GPU features. Only the ifx compiler supports GPU offload. See Get Started with OpenMP* Offload to GPU for the Intel® oneAPI DPC/C++ Compiler and Intel® Fortran Compiler (Beta)

What compilers support CUDA

I found some problem with Visual Studio. My project that use openMP multithreading was twice slow on Visual Studio 2010, than on Dev-C++ , Now I wrote my other project that uses CUDA technology , I think that my project works slow because of Visual Studio, so I need some other compiler that will support CUDA , my questions are:
is Dev-C++ support CUDA?
what compilers support CUDA except Visual Studio?
if there are a lot compilers supporting CUDA what will give best speed for application?
The CUDA Toolkit Release Notes list the supported platforms and compilers.
Well I think it's the other way around. The thing is, there is a driver called nvcc. it generates device code and host code and sends the host code to a compiler. It should be a C compiler and it should be in the executable path. (EDIT: and it should be gcc on Linux and cl on Windows and I think I should ignore mac as the release note did(?))
nvcc Compiler Info reads:
A general purpose C compiler is needed by nvcc in the following
situations:
During non-CUDA phases (except the run phase), because these phases will be forwarded by nvcc to this compiler
During CUDA phases, for several preprocessing stages (see also 0). On Linux platforms, the compiler is assumed to be ‘gcc’, or ‘g++’ for linking. On Windows platforms, the compiler is assumed to be ‘cl’. The
compiler executables are expected to be in the current executable
search path, unless option -compiler-bin-dir is specified, in which
case the value of this option must be the name of the directory in
which these compiler executables reside.
And please don't talk like that about compilers. Your code is in a way that works better with Dev-C++. What is generated is an assembly code. I don't say that they don't make any difference, but maybe 4 to 5%, not 100%.
And absolutely definitely don't blame the compiler for your slow program. It is definitely because of inefficient memory access and incorrect use of different types of memory.

cannot debug Thrust CUDA in Visual Studio [duplicate]

I am using visual studio 2010, parallel nsight 2.2 and cuda 4.2 for learning. My system is Windows 8 pro x64.
I opened the radix sort project which included by cuda computing SDK in VS, and compiled it with no error. The sort code uses thrust library:
if(keysOnly)
thrust::sort(d_keys.begin(), d_keys.end());
else
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_values.begin());
I want to know how thrust dispatch the sort function to cuda kernels, so I tried to add breakpoints in front of lines above and compiled the project in debug mode. But when I use parallel nsight for cuda debugging, there are always errors that "no source correspondence for breakpoint".
So, my problems are:
How to debug cuda thrust programs in visual studio with parallel nsight?
Or is there anyone can instruct me using another way to know how cuda thrust dipatch functions to cuda kernels or other functions?
Any advise will be appreciated!
Normally, to debug device code in CUDA, it's necessary to pass the:
-G -g
switches to nvcc. However this modality is not supported with thrust code. You can get an idea of how thrust code gets dispatched to the device by following the structure in the thrust include files. Since thrust is entirely templatized code, there are no libraries to worry about. However that's a challenging proposition. You can also tell the compiler to generate ptx:
-ptx
which is one of the intermediate code types that cuda code gets compiled to. However that is not a trivial thing to parse either. This link gives some alternate ideas for debugging with Thrust.

Why is bytecode JIT compiled at execution time and not at installation time?

Compiling a program to bytecode instead of native code enables a certain level of portability, so long a fitting Virtual Machine exists.
But I'm kinda wondering, why delay the compilation? Why not simply compile the byte code when installing an application?
And if that is done, why not adopt it to languages that directly compile to native code? Compile them to an intermediate format, distribute a "JIT" compiler with the installer and compile it on the target machine.
The only thing I can think of is runtime optimization. That's about the only major thing that can't be done at installation time. Thoughts?
Often it is precompiled. Consider, for example, precompiling .NET code with NGEN.
One reason for not precompiling everything would be extensibility. Consider those languages which allow use of reflection to load additional code at runtime.
Some JIT Compilers (Java HotSpot, for example) use type feedback based inlining. They track which types are actually used in the program, and inline function calls based on the assumption that what they saw earlier is what they will see later. In order for this to work, they need to run the program through a number of iterations of its "hot loop" in order to know what types are used.
This optimization is totally unavailable at install time.
The bytecode has been compiled just as well as the C++ code has been compiled.
Also the JIT compiler, i.e. .NET and the Java runtimes are massive and dynamic; And you can't foresee in a program which parts the apps use so you need the entire runtime.
Also one has to realize that a language targeted to a virtual machine has very different design goals than a language targeted to bare metal.
Take C++ vs. Java.
C++ wouldn't work on a VM, In particular a lot of the C++ language design is geared towards RAII.
Java wouldn't work on bare metal for so many reasons. primitive types for one.
EDIT: As delnan points out correctly; JIT and similar technologies, though hugely benificial to bytecode performance, would likely not be available at install time. Also compiling for a VM is very different from compiling to native code.

What compilers besides gcc can vectorize code?

GCC can vectorize loops automatically when certain options are specified and given the right conditions. Are there other compilers widely available that can do the same?
ICC
llvm can also do it and vector pascal too and one that is not free VectorC. These are just some I remember.
Also PGI's compilers.
The Mono project, the Open Source alternative to Microsoft's Silverlight project, has added objects that use SIMD instructions. While not a compiler, the Mono CLR is the first managed code system to generate vector operations natively.
IBM's xlc can auto-vectorize C and C++ to some extent as well.
Actually, in many cases GCC used to be quite worse than ICC for automatic code vectorization, I don't know if it recently improved enough, but I doubt it.
VectorC can do this too. You can also specify all target CPU so that it takes advantage of different instruction sets (e.g. MMX, SIMD, SIMD2,...)
Visual C++ (I'm using VS2005) can be forced to use SSE instructions. It seems not to be as good as Intel's compiler, but if someone already uses VC++, there's no reason not to turn this option on.
Go to project's properties, Configuration properties, C/C++, Code Generation: Enable Enhanced Instruction Set. Set "Streaming SIMD Instructios" or "Streaming SIMD Instructios 2". You will have to set floating point model to fast. Some other options will have to be changed too, but compiler will tell you about that.
Even though this is an old thread, I though I'd add to this list - Visual Studio 11 will also have auto vectorisation.

Resources