Does the -march=corei7-avx -mtune=corei7-avx or -march=corei7 -mtune=corei7 -mavx command line options to MinGW with the -mfpmath=sse command line option (or even with -mfpmath=both) enables using of AVX instruction for math routines? Note, that --with-fpmath=avx from here does not work (that is "unrecognized option" for recent builds on MinGW).
AVX is enabled by either -march=corei7-avx or -mavx. The -mtune option is neither necessary nor sufficient to enable AVX.
A -mfpmath=avx does not make any sense, because with this switch you control the generation of scalar floating point code. It makes no difference if you use only one float of a 4 float vector register or only one element of a 8 float vector register. If you have march=avx enabled, scalar floating point instructions will use the VEX encoding anyway, which will save a few mov instructions.
Note that on x86_64 -mfpmath defaults to SSE, so using this switch is usually not necessary or even harmful if you don't exactly know what you are doing.
Related
I was benchmarking some counting in a loop code.
g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).
What I find in this asm output funny is that code jumps over one simple add.
=> 0x42b46b <benchmark_many_ints()+1659>: movslq (%rdx),%rax
0x42b46e <benchmark_many_ints()+1662>: mov %rax,%rcx
0x42b471 <benchmark_many_ints()+1665>: imul %r9,%rax
0x42b475 <benchmark_many_ints()+1669>: shr $0xe,%rax
0x42b479 <benchmark_many_ints()+1673>: and $0x1ff,%eax
0x42b47e <benchmark_many_ints()+1678>: cmp (%r10,%rax,4),%ecx
0x42b482 <benchmark_many_ints()+1682>: jne 0x42b488 <benchmark_many_ints()+1688>
0x42b484 <benchmark_many_ints()+1684>: add $0x1,%rbx
0x42b488 <benchmark_many_ints()+1688>: add $0x4,%rdx
0x42b48c <benchmark_many_ints()+1692>: cmp %rdx,%r8
0x42b48f <benchmark_many_ints()+1695>: jne 0x42b46b <benchmark_many_ints()+1659>
Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction.
I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.
edit: source:
https://godbolt.org/z/v0Iiv4
The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)
branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.
I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.
-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.
BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)
what does the compilation options mean?
export FFLAGS = -O3 -r8 -i4 -I${PWD}/headers -nofor_main.
-r8 means what?i4 means what?where could I find the help file.can anybody explain compilation option FFLAGS?I really appreciate it
You apparently already know that FFLAGS is a list of options for a FORTRAN compiler.
-r8 sets the size of certain data types to 8 bytes, depending on architecture. It is approximately the same as setting double precision.
-i4 sets the default integer size to 4 bytes.
Do you need more?
EDIT:
There are a lot of different compilers, and versions of compilers. The default for GNUMake is f77, and from the UNIX man page:
-r8
Double the size of default REAL, DOUBLE, INTEGER, and COMPLEX data.
NOTE: This option is now considered obsolete and may be
removed in future releases. Use the more flexible -xtypemap
option instead.
This option sets the default size for REAL, INTEGER, and
LOGICAL to 8, and for COMPLEX to 16. For INTEGER and LOGI-
CAL the compiler allocates 8 bytes, but does 4-byte arith-
metic. For actual 8-byte arithmetic,
see -dbl.
i have problem in compiling a C code for MPC5643L powerpc board. the code has long long x variable and gcc assembles it as a floating number. since my registers are 64 bit how to compile for it using gcc.
You would pass -msoft-float to the compile command line. This will force it to not use any floating point registers, and instead treat the target as having only 32 32-bit GPRs.
I am implementing a Filter and I need to optimised as much as possible the implementation. I have realised that there is an instruction that need a lot of cycles and I do not understand why:
bool filters_apply(...)
{
short sSample;
double dSample;
...
...
sSample = (short) dSample; //needs a lot of cycles to execute
...
...
}
I am using de GCC Option: -mcpu=arm926ej-s -mfloat-abi=softfp -mfpu=vfp
I have try to compile with the FP ABI "hard" to see if there is difference, but the compiler does not implement it.
Could anyone explain me why that instruction needs so many cycles?
Thanks a lot!!
Just by looking to the information you've provided, it can be because of the stalls happening when you transfer data from a floating point register to an arm register.
This Debian page on arm floating modes claims, it can take around ~20 cycles for such operation.
Try to use floating point variables as much as possible, for example convert sSample to a float. Your arm926ej-s (vfpv2) should provide 32 single precision (16 double precision) registers.
The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?
The original page offers details on getting gcc to automatically vectorize
loops, including a few examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:
https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info
In summary, the following options will work for x86 chips with SSE2,
giving a log of loops that have been vectorized:
gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
Note that -msse is also a possibility, but it will only vectorize loops
using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)
Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:
gcc -O3 -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5
(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)
Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author
To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.
Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)
Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.
Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.
There is a gimple (an Intermediate Representation of GCC) pass pass_vectorize. This pass will enable auto-vectorization at gimple level.
For enabling autovectorization (GCC V4.4.0), we need to following steps:
Mention the number of words in a vector as per target architecture. This can be done by defining the macro UNITS_PER_SIMD_WORD.
The vector modes that are possible needs to be defined in a separate file usually <target>-modes.def. This file has to reside in the directory where other files containing the machine descriptions are residing on. (As per the configuration script. If you can change the script you can place the file in whatever directory you want it to be in).
The modes that are to be considered for vectorization as per target architecture. Like, 4 words will constitute a vector or eight half words will constitute a vector or two double-words will constitute a vector. The details of this needs to be mentioned in the <target>-modes.def file. For example:
VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI /
VECTOR_MODES (INT, 16); / V16QI V8HI V4SI V2DI /
VECTOR_MODES (FLOAT, 8); / V4HF V2SF */
Build the port. Vectorization can be enabled using the command line options -O2 -ftree-vectorize.