GCC support for Intel AVX instrinsics (dvec.h) - gcc

Does GCC support dvec.h, and if not, what can I do to port code written for ICC to work with GCC?
I am getting errors:
fatal error: dvec.h: No such file or directory
#include <dvec.h>
Alternatively, GCC cannot find F32vec8.

See Agner Fog's manual Optimizing software in C++. See section 12.5 Using vector classes.
Agner's Vector Class Library (VCL) is far more powerful than Intel's dvec.h, it works on more compilers (including GCC and Clang), and it's free. However, it requires C++.
Another option is to use Yeppp!. Yepp works for C, C++, C#, Java, and FORTRAN and not just C++. However, it's an actually library that you must link in. The VCL is only a set of header files.
Another difference between the Yeppp! and the VCL is that Yeppp! is built from assembly whereas the VCL uses intrinsics. This is one reason Yeppp! needs to be linked in (MSVC 64-bit mode does not allow inline assembly).
One disadvantage of intrinsics is that the compiler can implement them differently than you expect. This is not normally a problem with ICC and GCC. They are excellent when it comes to intrinsic. However, MSVC with AVX and especially FMA is disappointing (though with SSE it's normally fine). So the performance using the VCL with GCC compared to MSVC may be quite different with AVX and FMA.
With assembly you always get what you want. However, since Yeppp! is not inline assembly you have to deal with the function calling overhead. In my case most of the time I want something like inline assembly which is what intrinsics mostly achieve.
I don't know Yeppp! well but the documentation of the VCL library is excellent and the source code is very clear.

Related

What it takes to make OpenACC/OpenMP4.0 offloading to nvidia/mic work om GCC?

I am trying to understand how exactly I can use OpenACC to offload computation to my nvidia GPU on GCC 5.3. The more I google things the more confused I become. All the guides I find, they involve recompiling the entire gcc along with two libs called nvptx-tools and nvptx-newlib. Other sources say that OpenACC is part of GOMP library. Other sources say that the development for OpenACC support will continue only on GCC 6.x. Also I have read that support for OpenACC is in the main brunch of GCC. However if I compile a program with -fopenacc and -foffload=nvptx-non is just wont work. Can someone explain to me what exactly it takes to compiler and run OpenACC code with gcc 5.3+?
Why some guides seem to require (re)compilation of nvptx-tools, nvptx-newlib, and GCC, if, as some internet sources say, OpenACC support is part of GCC's main branch?
What is the role of the GOMP library in all this?
Is it true that development for OpenACC support will only be happening for GCC 6+ from now on?
When OpenACC support matures, is it the goal to enable it in a similar way we enable OpenMP (i.e., by just adding a couple of compiler flags)?
Can someone also provide answers to all the above after replacing "OpenACC" with "OpenMP 4.0 GPU/MIC offload capability"?
Thanks in advance
The link below contains a script that will compile gcc for OpenACC support.
https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-offload-openacc-build-install.sh
OpenACC is part of GCC's main branch now, but there are some points to note. Even if there are libraries that are part of gcc, when you compile gcc, you have to specify which libraries to compile. Not all of them will be compiled by default. For OpenACC there's an additional problem. Since, NVIDIA drivers are not open source, GCC cannot compile OpenACC directly to binaries. It needs to compile OpenACC to the intermediate NVPTX instructions which the Nvidia runtime will handle. Therefore you also need to install nvptx libs.
GOMP library is the intermediate library that handles both OpenMP and OpenACC
Yes, I think OpenACC development will only be happening in GCC 6, but it may still be backported to GCC 5. But your best best would be to use GCC 6.
While I cannot comment on what GCC developers decide to do, I think in the first point I have already stated what the problems are. Unless NVIDIA make their drivers open source, I think an extra step will always be necessary.
I believe right now OpenMP is planned only for CPU's and MIC. I believe OpenMP support for both will probably become default behavior. I am not sure whether OpenMP targeting NVIDIA GPU's are immediately part of their target, but since GCC is using GOMP for both OpenMP and OpenACC, I believe eventually they might be able to do it. Also, GCC is also targeting HSA using OpenMP, so basically AMD APU's. I am not sure whether AMD GPU's will work the same way, but it maybe possible. Since, AMD is making their drivers open source, I believe they maybe easier to integrate into default behavior.

Does AT&T syntax work on intel platform?

I'm learning assembly.
I know that gcc supports at&t syntax but i want my program to run on intel processors.
So would it work on intel processors regardless the syntax or it must be intel syntax to work on intel platform!! i'm confused??
Thanks.
att vs intel syntax has been covered many times, here and other places.
Assembly language is a language defined by the assembler, the particular program used to convert the ASCII assembly language into machine code for the particular target you are interested in. Unlike say a C or C++ compiler where there is a standard that defines the language, you can have 7 assemblers for the same target processor and there is no reason to assume that the assembly languages have to be compatible in any way shape or form. It is the machine code the produce that matters and if that machine code matches the same target then use the tool you like the best for whatever reason.
For this case there was the intel format as defined by the intel documentation and supported by the intel assembler. And then supported sorta by other assemblers. The instructions were close or the same they might have had a compatibility mode and maybe they had their own directives. For example as86 (or was it asm86 or a86?) tasm, masm, and currently nasm. And then you had this AT&T syntax, someone somewhere (ATT?) decided to make an assembler with a goofy assembly language that specifically didnt match the intel documentation at all. And that became the Intel vs AT&T syntax thing. gnu assembler is well known for messing up existing assembly languages as well, and they apparently use AT&T with their own nuances thrown in. they might have an intel syntax switch you should check.
The question you should be asking is the target, and assemblers like gnu assembler for x86 are often capable of generating code for various flavors of x86, so you need to make sure it matches your computer (most likely does if you dont add any target type/specific options).
There is no reason to assume an AT&T syntax assembler (gnu assembler (gas or as)) would not work.

How should I disable C++0x and/or C++11 on the command line with Intel's Windows compiler?

The build system on my cross-platform project has a command line for Intel's Windows C++ that may or may not have /Qstd=c++0x as a result of detecting the compiler feature set. For most of the code base, this works well, however for a small number of CUDA files, I need to disable the more recent dialects of C++ to suit the constraints of the nvcc wrapper compiler.
How should I phrase something like /Qstd=c++98 or /Qnostd=c++0x at the end of the command line so that it overrides any earlier specifications of C++ dialect?
Edit: Having been educated that these flags are actually for the Intel compiler, I have found that appending /Qstd=c++98 is probably the right approach.
You can't for MSVC. Each MSVC version expects its own interpretation of something between two or three standards, and you're stuck with it.
The options you quote are for the Intel Compiler (see here). If possible, I'd suggest using the Intel Compiler then.
I do fail to see how disabling the recent dialects in the C++ compiler will please the nvcc wrapper compiler... Just don't write C++11 code, and you'll be fine right?

gcc __sync builtins and x86

I was looking at a question about atomic compare and swap and gcc intrinsics. I noticed that an answer quoted from the gcc manual (note the answer I looked at quoted from an earlier version of gcc but I've linked to the latest versions manual because I had checked to see if anything changed). However, when I looked at the text in the manual I saw that it appears to reference Itanium rather than x86:
The following builtins are intended to be compatible with those
described in the Intel Itanium Processor-specific Application Binary
Interface, section 7.4. As such, they depart from the normal GCC
practice of using the “__builtin_” prefix, and further that they are
overloaded such that they work on multiple types.
My question is why does gcc reference Itanium documentation and does that effect how the intrinsics work on x86? Are there any differences or is it safe to assume that even though the gcc manual references the Itanium manual that everything the gcc manual describes will work correctly on an x86 system?
My understanding is that a lot of gcc's ABI decisions (the egcs fork) were based on the ABI specs for the good ship Itanic. This included the name mangling conventions for C++ symbols. There was a large effort (Project Trillian) to have IA-64 Linux (and GCC) ready to go when the actual processor became available. The semantics are intended to be platform-independent, though they will be replaced by the __atomic builtins.

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
Are there other use-cases for the PTX code?
Do you look into your PTX code?
Where can I find out how to be able to read the PTX code CUDA generates for my kernels?
The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.
If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.
Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Resources