What are the source file which need to be modified in that case with intel compiler? How to use "-fast" or inter procedural optimizations if "-fast" is impossible?
edit: forget this if it is too-specific: I use http://pf.natalenko.name and ultra ksm
Related
I am trying to understand how exactly I can use OpenACC to offload computation to my nvidia GPU on GCC 5.3. The more I google things the more confused I become. All the guides I find, they involve recompiling the entire gcc along with two libs called nvptx-tools and nvptx-newlib. Other sources say that OpenACC is part of GOMP library. Other sources say that the development for OpenACC support will continue only on GCC 6.x. Also I have read that support for OpenACC is in the main brunch of GCC. However if I compile a program with -fopenacc and -foffload=nvptx-non is just wont work. Can someone explain to me what exactly it takes to compiler and run OpenACC code with gcc 5.3+?
Why some guides seem to require (re)compilation of nvptx-tools, nvptx-newlib, and GCC, if, as some internet sources say, OpenACC support is part of GCC's main branch?
What is the role of the GOMP library in all this?
Is it true that development for OpenACC support will only be happening for GCC 6+ from now on?
When OpenACC support matures, is it the goal to enable it in a similar way we enable OpenMP (i.e., by just adding a couple of compiler flags)?
Can someone also provide answers to all the above after replacing "OpenACC" with "OpenMP 4.0 GPU/MIC offload capability"?
Thanks in advance
The link below contains a script that will compile gcc for OpenACC support.
https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-offload-openacc-build-install.sh
OpenACC is part of GCC's main branch now, but there are some points to note. Even if there are libraries that are part of gcc, when you compile gcc, you have to specify which libraries to compile. Not all of them will be compiled by default. For OpenACC there's an additional problem. Since, NVIDIA drivers are not open source, GCC cannot compile OpenACC directly to binaries. It needs to compile OpenACC to the intermediate NVPTX instructions which the Nvidia runtime will handle. Therefore you also need to install nvptx libs.
GOMP library is the intermediate library that handles both OpenMP and OpenACC
Yes, I think OpenACC development will only be happening in GCC 6, but it may still be backported to GCC 5. But your best best would be to use GCC 6.
While I cannot comment on what GCC developers decide to do, I think in the first point I have already stated what the problems are. Unless NVIDIA make their drivers open source, I think an extra step will always be necessary.
I believe right now OpenMP is planned only for CPU's and MIC. I believe OpenMP support for both will probably become default behavior. I am not sure whether OpenMP targeting NVIDIA GPU's are immediately part of their target, but since GCC is using GOMP for both OpenMP and OpenACC, I believe eventually they might be able to do it. Also, GCC is also targeting HSA using OpenMP, so basically AMD APU's. I am not sure whether AMD GPU's will work the same way, but it maybe possible. Since, AMD is making their drivers open source, I believe they maybe easier to integrate into default behavior.
Does GCC support dvec.h, and if not, what can I do to port code written for ICC to work with GCC?
I am getting errors:
fatal error: dvec.h: No such file or directory
#include <dvec.h>
Alternatively, GCC cannot find F32vec8.
See Agner Fog's manual Optimizing software in C++. See section 12.5 Using vector classes.
Agner's Vector Class Library (VCL) is far more powerful than Intel's dvec.h, it works on more compilers (including GCC and Clang), and it's free. However, it requires C++.
Another option is to use Yeppp!. Yepp works for C, C++, C#, Java, and FORTRAN and not just C++. However, it's an actually library that you must link in. The VCL is only a set of header files.
Another difference between the Yeppp! and the VCL is that Yeppp! is built from assembly whereas the VCL uses intrinsics. This is one reason Yeppp! needs to be linked in (MSVC 64-bit mode does not allow inline assembly).
One disadvantage of intrinsics is that the compiler can implement them differently than you expect. This is not normally a problem with ICC and GCC. They are excellent when it comes to intrinsic. However, MSVC with AVX and especially FMA is disappointing (though with SSE it's normally fine). So the performance using the VCL with GCC compared to MSVC may be quite different with AVX and FMA.
With assembly you always get what you want. However, since Yeppp! is not inline assembly you have to deal with the function calling overhead. In my case most of the time I want something like inline assembly which is what intrinsics mostly achieve.
I don't know Yeppp! well but the documentation of the VCL library is excellent and the source code is very clear.
Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?
The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.
I used gcc to compile a few fortran source files into *.lib and *.dll on Windows platform, using the latest version of mingw . The gcc used is version 3. The result of the output is arpack_win32.dll, blas_win32.dll and lapack_win32.dll.
I then want to compile sssimp.f against the arpack_win32.dll, blas_win32.dll and lapack_win32.dll using Intel visual fortran compiler for Windows, because sssimp.f uses those dlls. But I got the impression ( from Intel support forum) that this is not possible.
Is my impression correct? Or is it that as long as I can produce the underlying libs and dlls ( no matter in which compiler and how old it is), I can use them as my base libs and dlls, and I can link to them from any, modern or old, compiler?
g77 uses a different ABI than IVF, yes. So unless IVF has some g77/f2c compatibility option it's not going to work.
The easiest solution for you is probably to use IVF to compile the libraries too.
As already pointed out, mixing compilers with different calling conventions is likely to be very difficult.
That answer on the Intel Forum pointed out a version of arpack translated to Fortran 90 -- http://people.sc.fsu.edu/~burkardt/f_src/arpack/arpack.html -- can you use that? Also see http://people.sc.fsu.edu/~burkardt/f_src/lapack/lapack.html and http://people.sc.fsu.edu/~burkardt/f_src/blas1_s/blas1_s.html
Or Intel Visual Fortran should be able to compile Fortran 77 using suitable compiler options. What language constructs is it rejecting?
GCC can vectorize loops automatically when certain options are specified and given the right conditions. Are there other compilers widely available that can do the same?
ICC
llvm can also do it and vector pascal too and one that is not free VectorC. These are just some I remember.
Also PGI's compilers.
The Mono project, the Open Source alternative to Microsoft's Silverlight project, has added objects that use SIMD instructions. While not a compiler, the Mono CLR is the first managed code system to generate vector operations natively.
IBM's xlc can auto-vectorize C and C++ to some extent as well.
Actually, in many cases GCC used to be quite worse than ICC for automatic code vectorization, I don't know if it recently improved enough, but I doubt it.
VectorC can do this too. You can also specify all target CPU so that it takes advantage of different instruction sets (e.g. MMX, SIMD, SIMD2,...)
Visual C++ (I'm using VS2005) can be forced to use SSE instructions. It seems not to be as good as Intel's compiler, but if someone already uses VC++, there's no reason not to turn this option on.
Go to project's properties, Configuration properties, C/C++, Code Generation: Enable Enhanced Instruction Set. Set "Streaming SIMD Instructios" or "Streaming SIMD Instructios 2". You will have to set floating point model to fast. Some other options will have to be changed too, but compiler will tell you about that.
Even though this is an old thread, I though I'd add to this list - Visual Studio 11 will also have auto vectorisation.