For gcc compiler, what x86-64 instruction set does gcc target when you compile without any flags versus -O2? - gcc

For x86-64 there are lots of instruction sets that speed up code execution. Here is a list from gcc wiki https://gcc.gnu.org/wiki/FunctionMultiVersioning:
MMX
SSE
SSE2
SSE3
SSSE3
SSE4.1
SSE4.2
POPCNT
AVX
AVX2
For gcc compiler, what x86-64 instruction set does gcc target when you compile without any flags versus -O2?
To keep things simple lets just say the question is about gcc version 12 (most recent major). But I would like to know what gcc command switches/options i need to do to so that i can see what my version of gcc version does.
I assume that gcc chooses something that is "portable" so that would mean probably something slow. But this is just my assumption... I would like to know does that mean like SSE4.2 or none?

If you don't pass a command-line -march option, then you get whatever was selected when gcc was compiled. The default is -march=x86-64 but it could have been overridden by whoever compiled your gcc (e.g. your binary package distributor). See https://gcc.gnu.org/install/configure.html and note the --with-arch option.
You can compile with -v -Q to see what option is in use. Look for the options passed line.
With -march=x86-64 you get "least common denominator" code that will run on every known x86-64 CPU, all the way back to the AMD K8. This includes SSE2, which was part of the original AMD64 spec, but not SSE3 or anything later. popcnt would not be included either.
The -march option is orthogonal to optimization options like -O2 and the -f... flags (e.g. -funroll-loops). You always get code compatible with whatever is selected by -march, no matter what optimization options are in use. However -m flags (like -mavx) can permit the use of other CPU features beyond what -march implies, in which case your code is only guaranteed to run on CPUs with those features.

Related

How can I determine what architectures gcc supports?

GCC supports a -march switch that allows you to specify the architecture you are targeting - allowing it to tune instruction sequences for that platform as well as using instructions that might be available on the platform which aren't available on the "default" or base version of the architecture.
For example, -march=skylake will tell the compiler to target Skylake CPUs, including using instruction sets available on Skylake such as AVX2.
How can I tell what values for -march the local version of gcc supports? Newer versions helpfully list the valid arguments when an invalid argument is passed, but older versions do not.
With gcc7 and later, gcc will print the values it supports as part of the error message.
$ gcc -E -march=help -xc /dev/null
# 1 "/dev/null"
cc1: error: bad value (‘help’) for ‘-march=’ switch
cc1: note: valid arguments to ‘-march=’ switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 bonnell atom silvermont slm knl x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 btver1 btver2
I checked on Godbolt, and x86 gcc6.x and earlier just say error: bad value (invalid) for -march= switch even with -v.
It also doesn't work with clang5.0 or ICC18.
This is target-specific: ARM gcc6.3 does produce a list of supported -march values, or -mcpu=.
For gcc-7.2.0, it's here:
https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/x86-Options.html#x86-Options
You could go to gcc online documentation. Then, find the manual for the version you are interested. Following that, go to machine dependent options section. If you are looking into x86, jump to the "x86 options" section. Now, search "-march."
I haven't checked the old gcc versions. Another way you could try is to check out the source code, and open the source code that keeps the literal strings for the supported arch.
svn checkout svn://gcc.gnu.org/svn/gcc/trunk gcc_trunk
cd gcc_trunk
Then, maybe, you could try like this:
find . -type f | egrep "*\.(c|cc|cpp|h|hpp)$" | xargs egrep '"skylake-avx'
As of today, the literal strings are kept in ./gcc/config/i386/i386.c in case of x86 architectures.
%P.S.
As Peter mentioned, it seems machine-specific. I suspect that there isn't a standard/desired behavior that lists available march values. For example, if gcc has been just ported to a brand-new instruction set architecture, LEG--as opposed to ARM--, it does not necessarily have a command-line option to list all supported march values.
Fortunately, it seems like some newer gcc versions provide a way to do so. If you do need such an option for old gccs, writing a gcc plugin, which might work from gcc 4.5 or so, could be taken into consideration:
gcc plugin
simple gcc plugin how to
Gcc plugins are plugged-in to an existing gcc by adding some command-line options. Gcc has APIs for plugins. All you need would be to write a code that checks the information such as gcc version, the arch that runs gcc, etc, and that prints out the list of the supported march.
Use the detailed help page:
gcc -v --help
Look for the option -march=CPU, for example in gcc v4.8.4
-march=CPU[,+EXTENSION...]
generate code for CPU and EXTENSION, CPU is one of:
generic32, generic64, i386, i486, i586, i686,
pentium, pentiumpro, pentiumii, pentiumiii, pentium4,
prescott, nocona, core, core2, corei7, l1om, k1om,
k6, k6_2, athlon, opteron, k8, amdfam10, bdver1,
bdver2, bdver3, btver1, btver2
EXTENSION is combination of:
8087, 287, 387, no87, mmx, nommx, sse, sse2, sse3,
ssse3, sse4.1, sse4.2, sse4, nosse, avx, avx2,
avx512f, avx512cd, avx512er, avx512pf, noavx, vmx,
vmfunc, smx, xsave, xsaveopt, aes, pclmul, fsgsbase,
rdrnd, f16c, bmi2, fma, fma4, xop, lwp, movbe, cx16,
ept, lzcnt, hle, rtm, invpcid, clflush, nop, syscall,
rdtscp, 3dnow, 3dnowa, padlock, svme, sse4a, abm,
bmi, tbm, adx, rdseed, prfchw, smap, mpx, sha,
clflushopt, xsavec, xsaves, prefetchwt1
Since GCC 4 there's a --target-help which prints the supported parameters for options including
-march
-mtune
-mabi
-masm
Other options which themselves are architecture-specific e.g. -msse2, -mavx2

AVX and newer intrinsics with GCC on Mac; what assembler would one need?

I have been tweaking GCC 6.3.0 to get it to use the libc++ runtime instead of libstdc++ so that g++ can be used without worrying about C++ runtime incompatibilities:
https://github.com/RJVB/macstrop/tree/master/lang/gcc6
The tweak works, I can build and run KDE software using g++ against Qt5 and KF5 frameworks (and everything else) built with various clang versions.
What doesn't work is generating code that uses AVX and presumably most or all newer intrinsic instructions.
This is not a new issue that's never been invoked on here; it's answered here for instance: How to use AVX/pclmulqdq on Mac OS X
Evidently one can configure gcc to call the linked script instead of the actual as executable.
But can gcc not be configured to use another assembler altogether, like nasm, and would that solve this issue?

What flags or environment variables can I pass to Clang to get maximum debugging on both BSD and Linux?

I'm interested in answers, approaches, and ideas out of the box. At a high level, the main page is pretty sparse and they mainly list -g, with one level, suggesting that -O0 is also either very helpful or essential.
But I'm wondering what other clang flags can be given to give maximum debugging. Is there an equivalent to gcc's -ggdb3 which includes some of the source or annotations directly in the object output? Or could there be? Is it possible and helpful to recompile the OS and its original libraries to have debugging (and if so, if I'm using Debian, can I have it write the debugging into the main .deb package instead of putting a separate debugged .deb package which stores debugging data in /usr/lib/debug?)? Will a static build of a binary affect the ability to see a good stacktrace? And is there anything that needs to be done to ensure that addr2line works well? Is it needed to compile all libraries (even glibc) with clang to get the maximum debugging benefit? I note that there is a project to recompile Debian with clang, and otherwise am open to a distribution that does so or otherwise places emphasis on debugging.
On Linux there are also options like an LD_PRELOAD set to /lib/libSegFault.so, or a set of LD_LIBRARY_PATH reassignments to /usr/lib/debug instead of the usual /usr/lib location (including redirecting libc itself to the debugged version). Is there a central place or external sources for answers to this question of how to enhance debuggability of a binary? The bigger mystery is clang, since I see in the long gcc man page that there are various options which can increase debugging (or reduce optimisations), but on the other hand the documentation for clang only shows a smaller set. It's possible that clang will accept more options than given, including gcc flags (which may either translate to a no-op or to more debugging - hard to tell without a canonical source of information).
Also from a package build perspective, since an external package may not respect CFLAGS, I've redirected /usr/bin/strip to be a no-op command that always succeeds, but other ideas on ensuring compliance are suggested (I believe that pkgsrc does a good job of wrapping gcc and the linker in shell scripts - useful to insert mandatory flags). Also there may be various ld options that can be passed to increase debugging of the outputted target. Also, it's quite possible that BSD (including FreeBSD 10, based upon clang) may have a different linking architecture which could make it easier to request and find debugged symbols in the generated libraries and executables.
To take debugging more broadly defined, I've set LD_WARN=yes, LD_DEBUG=unused, SEGFAULT_SIGNALS="all", LD_PRELOAD=.../libSegFault.so (as mentioned above), and LD_BIND_NOW=yes. Also I believe I can prefer that gcc search for libraries in /usr/lib/debug - above the standard search paths using strategic -Bs. Also, using --whole-archive for a static build might ensure that more objects are included in the linked output. There's also ulimit -c unlimited, and on Linux a nice way to differentiate core files like:
sysctl -w kernel.core_pattern="core.%t.SIG-%s.PID-%p.ID-%g-%u.%h.%E"
For gcc I've used and seen flags like: -O0 -fno-omit-frame-pointer -fverbose-asm -ggdb3 -mno-omit-leaf-frame-pointer -mtune=generic -fvar-tracking -D_GLIBCXX_DEBUG=1 -frecord-gcc-switches -femit-class-debug-always -fmath-errno -fno-eliminate-unused-debug-symbols -fno-eliminate-unused-debug-types -fno-merge-debug-strings -mieee-fp -mtune=generic -static-libgcc -fexceptions -fvar-tracking -fbounds-check -rdynamic -UNDEBUG -DDEBUG=1 (-ffreestanding -static-libgcc -pass-exit-codes) -fno-stack-check (since I believe I've read that the latter can interfere with debugging)
Other flags are there for other reasons but the emphasis is to be on maximum debugging. With all or most of the above, it's unclear to what extent clang would support or use there, or whether there are other options.
Clang does not support the -ggdb3 flag, only -g, as you have noticed. If you try to use it, you'll get the message:
clang: warning: argument unused during compilation: '-ggdb3'
so you can run your entire command line through Clang and it will tell you which of those GCC flags it supports and which it does not, some will print warnings, others may error out, but Clang will not silently ignore them. Here are the ones that Clang rejected when I tried your long command: -static-libgcc and -pass-exit-codes.
As pointed out in another SO answer, clang -cc1 --help can be used to list supported compilation flags, where we see the following which may be of interest to you:
-disable-llvm-optzns: Don't run LLVM optimization passes
-fno-elide-constructors: Disable C++ copy constructor elision
-mdisable-fp-elim: Disable frame pointer elimination optimization

What is the proper architecture-specific options (-m) for Sandy Bridge based Pentium?

I'm trying to figure out how to set -march option properly to see how much performance difference between the option enabled and disabled can occur on my PC with gcc 4.7.2.
Before trying compiling, I tried to find what is the best -march option for my PC. My PC has Pentium G850, whose architecture is Sandy Bridge. So I referred to the gcc 4.7.2 manual and found that -march=corei7-avx seems the best.
However, I remembered that Sandy Bridge based Pentium lacks AVX and AES-NI instruction set support, which is true for Pentium G850. So -march=corei7-avx is not a proper option.
I come up with some potential options:
-march=corei7-avx -mno-avx -mno-aes
-march=corei7 -mtune=corei7-avx
-march=native
The first option looks reasonable considering information I have, but I'm anxious that there may be missing feature other than AVX and AES-NI. The second option looks safe, but it could miss some minor features on Sandy Bridge because of -march=corei7. The third option will take care of all of my concerns, but I've heard this option sometimes misdetects features of CPU so I would like to know how to manually do that.
I've googled and searched StackOverflow and SuperUser, but I can't find any clear solutions...
What options should be set?
What about detecting via GCC, for me (gcc-5.3.0) on an i5-2450M CPU (Lenovo e520), the following shows:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
/usr/libexec/gcc/x86_64-pc-linux-gnu/5.3.0/cc1 -E -quiet -v - -march=sandybridge
-mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16
-msahf -mno-movbe -maes -mno-sha -mpclmul -mpopcnt -mno-abm -mno-lwp
-mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx
-mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd
-mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr
-mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd
-mno-vx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves
-mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma
-mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx --param
l1-cache-size=32 --param l1-cache-line-size=64 --param
l2-cache-size=3072 -mtune=sandybridge -fstack-protector-strong
I would suggest to use -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes. It is important to specify -mtune because this option tells gcc which CPU model it should use for scheduling instructions in the generated code.
I hava a Sandy Bridge based Intel(R) Celeron(R) CPU G530.
When use -march=native in gentoo's CFLAGS, and then compile media-video/ffmpeg-1.2.6 (current stable version in Gentoo), there is something wrong when playing video with mplayer( illegal instruction). Just like what you said, -mtune=native sometimes misdetects features of CPU.
Then I change to -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes, and recompile ffmpeg-1.2.6 and mplayer, things are all ok till now.

How to use gcc and -msoft-float on an i386/x86-64? [duplicate]

Is it (easily) possible to use software floating point on i386 linux without incurring the expense of trapping into the kernel on each call? I've tried -msoft-float, but it seems the normal (ubuntu) C libraries don't have a FP library included:
$ gcc -m32 -msoft-float -lm -o test test.c
/tmp/cc8RXn8F.o: In function `main':
test.c:(.text+0x39): undefined reference to `__muldf3'
collect2: ld returned 1 exit status
It is surprising that gcc doesn't support this natively as the code is clearly available in the source within a directory called soft-fp. It's possible to compile that library manually:
$ svn co svn://gcc.gnu.org/svn/gcc/trunk/libgcc/ libgcc
$ cd libgcc/soft-fp/
$ gcc -c -O2 -msoft-float -m32 -I../config/arm/ -I.. *.c
$ ar -crv libsoft-fp.a *.o
There are a few c files which don't compile due to errors but the majority does compile. After copying libsoft-fp.a into the directory with our source files they now compile fine with -msoft-float:
$ gcc -g -m32 -msoft-float test.c -lsoft-fp -L.
A quick inspection using
$ objdump -D --disassembler-options=intel a.out | less
shows that as expected no x87 floating point instructions are called and the code runs considerably slower as well, by a factor of 8 in my example which uses lots of division.
Note: I would've preferred to compile the soft-float library with
$ gcc -c -O2 -msoft-float -m32 -I../config/i386/ -I.. *.c
but that results in loads of error messages like
adddf3.c: In function '__adddf3':
adddf3.c:46: error: unknown register name 'st(1)' in 'asm'
Seems like the i386 version is not well maintained as st(1) points to one of the x87 registers which are obviously not available when using -msoft-float.
Strangely or luckily the arm version compiles fine on an i386 and seems to work just fine.
Unless you want to bootstrap your entire toolchain by hand, you could start with uclibc toolchain (the i386 version, I imagine) -- soft float is (AFAIK) not directly supported for "native" compilation on debian and derivatives, but it can be used via the "embedded" approach of the uclibc toolchain.
GCC does not support this without some extra libraries. From the 386 documentation:
-msoft-float Generate output containing library calls for floating
point. Warning: the requisite
libraries are not part of GCC.
Normally the facilities of the
machine's usual C compiler are used,
but this can't be done directly in
cross-compilation. You must make your
own arrangements to provide suitable
library functions for
cross-compilation.
On machines where a function returns
floating point results in the 80387
register stack, some floating point
opcodes may be emitted even if
-msoft-float is used
Also, you cannot set -mfpmath=unit to "none", it has to be sse, 387 or both.
However, according to this gnu wiki page, there is fp-soft and ieee. There is also SoftFloat.
(For ARM there is -mfloat-abi=softfp, but it does not seem like something similar is available for 386 SX).
It does not seem like tcc supports software floating point numbers either.
Good luck finding a library that works for you.
G'day,
Unless you're targetting a platform that doesn't have inbuilt FP support, I can't think of a reason why you'd want to emulate FP support.
Doesn't your x386 platform have external FPU support? Pity it's not a x486 with the FPU built in!
In my experience, any soft emulation is bound to be much slower than its hardware equivalent.
That's why I finished up writing a package in Ada to taget the onboard 68k FPU instead of using the soft emulation provided by the compiler manufacturer at the time. They finished up bundling it in their compiler as a matter of fact.
Edit: Just seen your comment below. Hmmm, if you don't need a full suite of FP support is it possible to roll your own for the few math functions you do need? That how the Ada package I mentioned got started.
HTH
cheers,

Resources