interprocedural analysis - gcc

Does gcc (C, C++ and Fortran compilers in particular) support interprocedural analysis to improve performance?
If yes, which are the relevant flags?
http://gcc.gnu.org/wiki/InterProcedural says the gcc is going to implement IPA, but that page is quite outdated.

Yes, it supports. Take a look at options started with -fipa here. Recent gfortran version (4.5+) supports even more sophisticated type of optimization - link-time optimization (LTO) which is interprocedural optimizations across files. The corresponding compiler flag is -flto.
P.S. I wrote a small series of posts about LTO at my blog. You're welcome! :-)

Related

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?
The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

Is there a performance different between compiling and linking mkl library via icc or gcc?

I cant find any info about this topic,
Is there a different in runtime performance when running a program which was compiled and linked with gcc or icc ?
(My assumption is that the program run on Intel architecture)
As both compilers are officially supported by MKL and they link the same libraries such as libmkl_core.a or libmkl_core.so, which do the actual work for MKL. The performance of MKL operations should be same. But of course the code written by yourself could be different as they are compiled by different compilers.
Edit
MKL is designed as a C library. Most of the APIs are pre-compiled and designed to run on large input data, which expect a relatively long running time. The way you calling the API won't affect the performance very much.
There are inline code and helper marcos through. For example mkl_direct_call.h include inline code/marco for small matrix multiplication, small matrix (size of ~20 or smaller) may get performance improvement with this code. So you may see performance difference when involving this part. Please refer to the following links for more details.
Improve Intel MKL Performance for Small Problems: The Use of MKL_DIRECT_CALL
Limitations of the Direct Call

what is compiler feedback based optimization? is it available with arm gcc compiler?

what is compiler feedback(not linker feedback) based optimization? How to get this feedback file for arm gcc compiler?
Read the chapter of the GCC documentation dedicated to optimizations (and also the section about ARM in GCC: ARM options)
You can use:
link-time optimization (LTO) by compiling and linking with -flto in addition of other optimization flags (so make CC='gcc -flto -O2'): the linking phase also do optimizations (so the compiler is linking files containing not only object code, but also intermediate GIMPLE internal compiler representation)
profile-guided optimization (PGO, with -fprofile-generate, -fprofile-use, -fauto-profile etc...): you first generate code with profiling instructions, you run some representative benchmarks to get profiling information, and you compile a second time using these profiling information.
You could mix both approaches and give a lot of other optimization flags. Be sure to be consistent with them.
On x86 & x86-64 (and ARM natively) you might also use -mtune=native and there are lots of other -mtune possibilities.
Some people call profile-based optimization compiler feedback optimization (because dynamic runtime profile information is given back into the compiler). I prefer the "profile-guided optimization" term. See also this old question.

arm-none-eabi-gcc: -march option v/s -mcpu option

I have been following j lynch tutorial from atmel for developing small programms for at91sam7s256 (microcontroller). I have done a bit tinkering and used arm-none-eabi instead of arm-elf (old one). By default i found that gcc compiles assuming -march=armv4t even if one does not mention anything about chip. How much difference it would if i use -mcpu=arm7tdmi?
Even searching a lot on google i could not find a detailed tutorial which would explain all possible command like options including separate linker options,assembler and objcopy options like -MAP etc.
Can you provide any such material where all possibilities are explained?
Providing information about the specific processor gives the compiler additional information for selecting the most efficient mix of instructions, and the most efficient way of scheduling those instructions. It depends very much on the specific processor how much performance difference explicitly specifying -mcpu makes. There could be no difference whatsoever - the only way to know is to measure.
But in general - if you are building a specific image for a specific device, then you should provide the compiler with as much information as possible.
Note: your current instance of gcc compiles assuming -march=armv4t - this is certainly not a universal guarantee for all arm gcc toolchains.

Compile time comparison between Windows GCC and MSVC compiler

We are working on reducing compile times on Windows and are therefore considering all options. I've tried to look on Google for a comparison between compile time using GCC (MinGW or Cygwin) and MSVC compiler (CL) without any luck. Of course, making a comparison would not be to hard, but I'd rather avoid reinventing the wheel if I can.
Does anyone know of such an comparison out there? Or maybe anyone has some hands-on-experience?
Input much appreciated :)
Comparing compiler is not trivial:
It may vary from processor to processor. GCC may better optimize for i7 and MSVC for Core 2 Duo or vice versa. Performance may be affected by cache etc. (Unroll loops or don't unroll loops, that is the question ;) ).
It depends very largely on how code is written. Certain idioms (equivalent to each other) may be preferred by one compiler.
It depends on how the code is used.
It depends on flags. For example gcc -O3 is known to often produce slower code then -O2 or -Os.
It depends on what assumption can be made about code. Can you allow strict aliasing or no (-fno-strict-aliasing/-fstrict-aliasing in gcc). Do you need full IEEE 754 or can you bent floating pointer calculation rules (-ffast-math).
It also depends on particular processor extensions. Do you enable MMX/SSE or not. Do you use intrinsics or no. Do you depend that code is i386 compatible or not.
Which version of gcc? Which version of msvc?
Do you use any of the gcc/msvc extensions?
Do you use microbenchmarking or macrobenchmarking?
And at the end you find out that the result was less then statistical error ;)
Even if the single application is used the result may be inconclusive (function A perform better in gcc but B in msvc).
PS. I would say cygwin will be slowest as it has additional level of indirection between POSIX and WinAPI.

Resources