Which method should I prefer to write SIMD instructions?
mm* methods form *mmintrin.h seem to be more portable across compilers.
But gcc vector extensions seems to produce mush simpler code, and to support more architectures.
So which method is the best?
If you use the gcc vector extensions you will only be able to use a limited subset of SSE functionality, since there are many SSE intrinsics which do not fit in with a generic vector model such as gcc's. If you only want to do fairly basic stuff, e.g. floating point arithmetic on vectors, then you might get away with it, but if you are interested in exploiting SIMD for maximum performance benefit then you'll need to go with the native intrinsics.
The intrinsics available from the *mmintrin.h files are available only on SSE machines, but they are available across different compilers. The GCC vector extensions are more limited but implemented on a wider range of platforms, and obviously GCC specific.
As with everything, there is no 'best' answer; you'll have to choose one that fits your needs.
Related
I have an ARM based platform with a Linux OS. Even though its gcc-based toolchain supports both hardfp and softfp, the vendor recommends using softfp and the platform is shipped with a set of standard and platform-related libraries which have only softfp version.
I'm making a computation-intensive (NEON) AI code based on OpenCV and tensorflow lite. Following the vendor guide, I have built these with softfp option. However, I have a feeling that my code is underperformed compared to other somewhat alike hardfp platforms.
Does the code performance depend on softfp/hardfp setting? Do I understand it right that all .o and .a files the compiler makes to build my program are also using softfp convention, which is less effective? If it does, are there any tricky ways to use hardfp calling convention internally but softfp for external libraries?
Normally, all objects that are linked together need to have the same float ABI. So if you need to use this softfp only library, i'm afraid you have to compile your own software in softfp too.
I had the same question about mixing ABIs. See here
Regarding the performance: the performance lost with softfp compared to hardfp is that you will pass (floating point) function parameters through usual registers instead of using FPU registers. This requires some additional copy between registers. As old_timer said it is impossible to evaluate the performance lost. If you have a single huge function with many float operations, the performance will be the same. If you have many small function calls with many floating variables and few operations, the performance will be dramatically slower.
The softfp option only affects the parameter passing.
In other words, unless you are passing lots of float type arguments while calling functions, there won't be any measurable performance hit compared to hardfp.
And since well designed projects heavily rely on passing pointer to structures instead of many single values, I would stick to softfp.
I want to write cross-platform C/C++ which has reproducible behaviour across different environments.
I understand that gcc's ffast-math enables various floating-point approximations. This is fine, but I need two separately-compiled binaries to produce the same results.
Say I use gcc always, but variously for Windows, Linux, or whatever, and different compiler versions.
Is there any guarantee that these compilations will yield the same set of floating-point approximations for the same source code?
No, it's not that they allow specific approximations, it's that -ffast-math allows compilers to assume that FP math is associative when it's not. i.e. ignore rounding error when transforming code to allow more efficient asm.
Any minor differences in choice of order of operations can affect the result by introducing different rounding.
Older compiler versions might choose to implement sqrt(x) as x * approx_rsqrt(x) with a Newton-Raphson iteration for -ffast-math, because older CPUs had a slower sqrtps instruction so it was more often worth it to replace it with an approximation of the reciprocal-sqrt + 3 or 4 more multiply and add instructions. This is generally not the case in most code for recent CPUs, so even if you use the same tuning options (especially the default -mtune=generic instead of -mtune=haswell for example), the choices that option makes can change between GCC versions.
It's hard enough to get deterministic FP without -ffast-math; different libraries on different OSes have different implementations of functions like sin and log (which unlike the basic ops + - * / sqrt are not required to return a "correctly rounded" result, i.e. max error 0.5ulp).
And extra precision for temporaries (FLT_EVAL_METHOD) can change the results if you compile for 32-bit x86 with x87 FP math. (-mfpmath=387 is the default for -m32). If you want to have any hope here, you'll want to avoid 32-bit x86. Or if you're stuck with it, maybe you can get away with -msse2 -mfpmath=sse...
You mentioned Windows, so I'm assuming you're only talking about x86 GNU/Linux, even though Linux runs on many other ISAs.
But even just within x86, compiling with -march=haswell enables use of FMA instructions, and GCC defaults to #pragma STDC FP_CONTRACT ON (even across C statements, beyond what the usual ISO C rules allow.) So actually even without -ffast-math, FMA availability can remove rounding for the x*y temporary in x*y + z.
With -ffast-math:
One version of gcc might decide to unroll a loop by 2 (and use 2 separate accumulators), when summing sum an array, while an older version of gcc with the same options might still sum in order.
(Actually current gcc is terrible at this, when it does unroll (not by default) it often still uses the same (vector) accumulator so it doesn't hide FP latency the way clang does. e.g. https://godbolt.org/z/X6DTxK uses different registers for the same variable, but it's still just one accumulator, no vertical addition after the sum loop. But hopefully future gcc versions will be better. And differences between gcc versions in how they do a horizontal sum of a YMM or XMM register could introduce differences there when auto-vectorizing)
C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?
The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...
We are working on reducing compile times on Windows and are therefore considering all options. I've tried to look on Google for a comparison between compile time using GCC (MinGW or Cygwin) and MSVC compiler (CL) without any luck. Of course, making a comparison would not be to hard, but I'd rather avoid reinventing the wheel if I can.
Does anyone know of such an comparison out there? Or maybe anyone has some hands-on-experience?
Input much appreciated :)
Comparing compiler is not trivial:
It may vary from processor to processor. GCC may better optimize for i7 and MSVC for Core 2 Duo or vice versa. Performance may be affected by cache etc. (Unroll loops or don't unroll loops, that is the question ;) ).
It depends very largely on how code is written. Certain idioms (equivalent to each other) may be preferred by one compiler.
It depends on how the code is used.
It depends on flags. For example gcc -O3 is known to often produce slower code then -O2 or -Os.
It depends on what assumption can be made about code. Can you allow strict aliasing or no (-fno-strict-aliasing/-fstrict-aliasing in gcc). Do you need full IEEE 754 or can you bent floating pointer calculation rules (-ffast-math).
It also depends on particular processor extensions. Do you enable MMX/SSE or not. Do you use intrinsics or no. Do you depend that code is i386 compatible or not.
Which version of gcc? Which version of msvc?
Do you use any of the gcc/msvc extensions?
Do you use microbenchmarking or macrobenchmarking?
And at the end you find out that the result was less then statistical error ;)
Even if the single application is used the result may be inconclusive (function A perform better in gcc but B in msvc).
PS. I would say cygwin will be slowest as it has additional level of indirection between POSIX and WinAPI.
I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores