Compile time comparison between Windows GCC and MSVC compiler - windows

We are working on reducing compile times on Windows and are therefore considering all options. I've tried to look on Google for a comparison between compile time using GCC (MinGW or Cygwin) and MSVC compiler (CL) without any luck. Of course, making a comparison would not be to hard, but I'd rather avoid reinventing the wheel if I can.
Does anyone know of such an comparison out there? Or maybe anyone has some hands-on-experience?
Input much appreciated :)

Comparing compiler is not trivial:
It may vary from processor to processor. GCC may better optimize for i7 and MSVC for Core 2 Duo or vice versa. Performance may be affected by cache etc. (Unroll loops or don't unroll loops, that is the question ;) ).
It depends very largely on how code is written. Certain idioms (equivalent to each other) may be preferred by one compiler.
It depends on how the code is used.
It depends on flags. For example gcc -O3 is known to often produce slower code then -O2 or -Os.
It depends on what assumption can be made about code. Can you allow strict aliasing or no (-fno-strict-aliasing/-fstrict-aliasing in gcc). Do you need full IEEE 754 or can you bent floating pointer calculation rules (-ffast-math).
It also depends on particular processor extensions. Do you enable MMX/SSE or not. Do you use intrinsics or no. Do you depend that code is i386 compatible or not.
Which version of gcc? Which version of msvc?
Do you use any of the gcc/msvc extensions?
Do you use microbenchmarking or macrobenchmarking?
And at the end you find out that the result was less then statistical error ;)
Even if the single application is used the result may be inconclusive (function A perform better in gcc but B in msvc).
PS. I would say cygwin will be slowest as it has additional level of indirection between POSIX and WinAPI.

Related

Does GCC's ffast-math have consistency guarantees across platforms or compiler versions?

I want to write cross-platform C/C++ which has reproducible behaviour across different environments.
I understand that gcc's ffast-math enables various floating-point approximations. This is fine, but I need two separately-compiled binaries to produce the same results.
Say I use gcc always, but variously for Windows, Linux, or whatever, and different compiler versions.
Is there any guarantee that these compilations will yield the same set of floating-point approximations for the same source code?
No, it's not that they allow specific approximations, it's that -ffast-math allows compilers to assume that FP math is associative when it's not. i.e. ignore rounding error when transforming code to allow more efficient asm.
Any minor differences in choice of order of operations can affect the result by introducing different rounding.
Older compiler versions might choose to implement sqrt(x) as x * approx_rsqrt(x) with a Newton-Raphson iteration for -ffast-math, because older CPUs had a slower sqrtps instruction so it was more often worth it to replace it with an approximation of the reciprocal-sqrt + 3 or 4 more multiply and add instructions. This is generally not the case in most code for recent CPUs, so even if you use the same tuning options (especially the default -mtune=generic instead of -mtune=haswell for example), the choices that option makes can change between GCC versions.
It's hard enough to get deterministic FP without -ffast-math; different libraries on different OSes have different implementations of functions like sin and log (which unlike the basic ops + - * / sqrt are not required to return a "correctly rounded" result, i.e. max error 0.5ulp).
And extra precision for temporaries (FLT_EVAL_METHOD) can change the results if you compile for 32-bit x86 with x87 FP math. (-mfpmath=387 is the default for -m32). If you want to have any hope here, you'll want to avoid 32-bit x86. Or if you're stuck with it, maybe you can get away with -msse2 -mfpmath=sse...
You mentioned Windows, so I'm assuming you're only talking about x86 GNU/Linux, even though Linux runs on many other ISAs.
But even just within x86, compiling with -march=haswell enables use of FMA instructions, and GCC defaults to #pragma STDC FP_CONTRACT ON (even across C statements, beyond what the usual ISO C rules allow.) So actually even without -ffast-math, FMA availability can remove rounding for the x*y temporary in x*y + z.
With -ffast-math:
One version of gcc might decide to unroll a loop by 2 (and use 2 separate accumulators), when summing sum an array, while an older version of gcc with the same options might still sum in order.
(Actually current gcc is terrible at this, when it does unroll (not by default) it often still uses the same (vector) accumulator so it doesn't hide FP latency the way clang does. e.g. https://godbolt.org/z/X6DTxK uses different registers for the same variable, but it's still just one accumulator, no vertical addition after the sum loop. But hopefully future gcc versions will be better. And differences between gcc versions in how they do a horizontal sum of a YMM or XMM register could introduce differences there when auto-vectorizing)

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?
The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

Testing FPU on arm processor

I am using a Wandboard-Quad that contains an i.MX6 ARM processor. This processor has an FPU that I would like to utilize. Before I do, I want to test how much improvement I will get. I have a benchmark algorithm and have tried it with no optimization, and with -mfpu=vfp and there appears to be no improvement -- I do get improvement with optimization = 3.
I am using arm-linux-gnueabi libraries -- Any thoughts on what is incorrect and how I can tell if I am using the FPU?
Thanks,
Adam
Look at the assembler output with a -S flag and see if there are any fpu instructions being generated. That's probably the easiest thing.
Beyond that, there is a chance that your algorithm was using floating point so rarely that any use would be masked by loading and unloading the FPU registers. In that case, O3 optimizations in your other parts of the code would show you gains separate of the FPU usage.
-mfpu option works only when GCC is performing vectorization. Vectorization itself requires reasonable optimization level (minimum is -O2 with -ftree-vectorize option on). So try -O3 -ftree-vectorize -mfpu=vfp to utilize FPU and measure difference against simple -O3 level.
Also see ARM GCC docs for cases where -funsafe-math-optimizations may be required.
Without any optimisation the output from GCC is so inefficient that you might actually not be able to measure the difference between software and hardware floating point.
To see the benefits that the FPU adds, you need to test with a consistent optimisation level, and then use either -msoft-float or -mhard-float.
This will force the compiler to link against different libraries and make function calls for the floating-point operations rather than using native instructions. It is still possible that the underlying library uses hardware floating point, but I wouldn't worry about that too much.
You can select different sets of FP instructions using -mfpu=. For i.MX6 I think you want -mfpu=neon, as that should enable all applicable floating-point instructions (not just the NEON ones).

gcc, simd intrinsics and fast-math concepts

Hi all :)
I'm trying to get a hang on a few concepts regarding floating point, SIMD/math intrinsics and the fast-math flag for gcc. More specifically, I'm using MinGW with gcc v4.5.0 on a x86 cpu.
I've searched around for a while now, and that's what I (think I) understand at the moment:
When I compile with no flags, any fp code will be standard x87, no simd intrinsics, and the math.h functions will be linked from msvcrt.dll.
When I use mfpmath, mssen and/or march so that mmx/sse/avx code gets enabled, gcc actually uses simd instructions only if I also specify some optimization flags, like On or ftree-vectorize. In which case the intrinsics are chosen automagically by gcc, and some math functions (I'm still talking about the standard math funcs on math.h) will become intrinsics or optimized out by inline code, some others will still come from the msvcrt.dll.
If I don't specify optimization flags, does any of this change?
When I use specific simd data types (those available as gcc extensions, like v4si or v8qi), I have the option to call intrinsic funcs directly, or again leave the automagic decision to gcc. Gcc can still chose standard x87 code if I don't enable simd instructions via the proper flags.
Again, if I don't specify optimization flags, does any of this change?
Plese correct me if any of my statements is wrong :p
Now the questions:
Do I ever have to include x86intrin.h to use intrinsics?
Do I ever have to link the libm?
What fast-math has to do with anything? I understand it relaxes the IEEE standard, but, specifically, how? Other standard functions are used? Some other lib is linked? Or are just a couple of flags set somewhere and the standard lib behaves differently?
Thanks to anybody who is going to help :D
Ok, I'm ansewring for anyone who is struggling a bit to grasp these concepts like me.
Optimizations with Ox work on any kind of code, fpu or sse
fast-math seems to work only on x87 code. Also, it doesn't seem to change the fpu control word o_O
Builtins are always included. This behavior can be avoided for some builtins, with some flags, like strict or no-builtins.
The libm.a is used for some stuff that is not included in the glibc, but with mingw it's just a dummy file, so at the moment it's useless to link to it
Using the special vector types of gcc seems useful only when calling the intrinsics directly, otherwise the code gets vectorized anyway.
Any correction is welcomed :)
Useful links:
fpu / sse control
gcc math
and the gcc manual on "Vector Extensions", "X86 Built-in functions" and "Other Builtins"

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Resources