For x86-64 there are lots of instruction sets that speed up code execution. Here is a list from gcc wiki https://gcc.gnu.org/wiki/FunctionMultiVersioning:
MMX
SSE
SSE2
SSE3
SSSE3
SSE4.1
SSE4.2
POPCNT
AVX
AVX2
For gcc compiler, what x86-64 instruction set does gcc target when you compile without any flags versus -O2?
To keep things simple lets just say the question is about gcc version 12 (most recent major). But I would like to know what gcc command switches/options i need to do to so that i can see what my version of gcc version does.
I assume that gcc chooses something that is "portable" so that would mean probably something slow. But this is just my assumption... I would like to know does that mean like SSE4.2 or none?
If you don't pass a command-line -march option, then you get whatever was selected when gcc was compiled. The default is -march=x86-64 but it could have been overridden by whoever compiled your gcc (e.g. your binary package distributor). See https://gcc.gnu.org/install/configure.html and note the --with-arch option.
You can compile with -v -Q to see what option is in use. Look for the options passed line.
With -march=x86-64 you get "least common denominator" code that will run on every known x86-64 CPU, all the way back to the AMD K8. This includes SSE2, which was part of the original AMD64 spec, but not SSE3 or anything later. popcnt would not be included either.
The -march option is orthogonal to optimization options like -O2 and the -f... flags (e.g. -funroll-loops). You always get code compatible with whatever is selected by -march, no matter what optimization options are in use. However -m flags (like -mavx) can permit the use of other CPU features beyond what -march implies, in which case your code is only guaranteed to run on CPUs with those features.
GCC supports a -march switch that allows you to specify the architecture you are targeting - allowing it to tune instruction sequences for that platform as well as using instructions that might be available on the platform which aren't available on the "default" or base version of the architecture.
For example, -march=skylake will tell the compiler to target Skylake CPUs, including using instruction sets available on Skylake such as AVX2.
How can I tell what values for -march the local version of gcc supports? Newer versions helpfully list the valid arguments when an invalid argument is passed, but older versions do not.
With gcc7 and later, gcc will print the values it supports as part of the error message.
$ gcc -E -march=help -xc /dev/null
# 1 "/dev/null"
cc1: error: bad value (‘help’) for ‘-march=’ switch
cc1: note: valid arguments to ‘-march=’ switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 bonnell atom silvermont slm knl x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 btver1 btver2
I checked on Godbolt, and x86 gcc6.x and earlier just say error: bad value (invalid) for -march= switch even with -v.
It also doesn't work with clang5.0 or ICC18.
This is target-specific: ARM gcc6.3 does produce a list of supported -march values, or -mcpu=.
For gcc-7.2.0, it's here:
https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/x86-Options.html#x86-Options
You could go to gcc online documentation. Then, find the manual for the version you are interested. Following that, go to machine dependent options section. If you are looking into x86, jump to the "x86 options" section. Now, search "-march."
I haven't checked the old gcc versions. Another way you could try is to check out the source code, and open the source code that keeps the literal strings for the supported arch.
svn checkout svn://gcc.gnu.org/svn/gcc/trunk gcc_trunk
cd gcc_trunk
Then, maybe, you could try like this:
find . -type f | egrep "*\.(c|cc|cpp|h|hpp)$" | xargs egrep '"skylake-avx'
As of today, the literal strings are kept in ./gcc/config/i386/i386.c in case of x86 architectures.
%P.S.
As Peter mentioned, it seems machine-specific. I suspect that there isn't a standard/desired behavior that lists available march values. For example, if gcc has been just ported to a brand-new instruction set architecture, LEG--as opposed to ARM--, it does not necessarily have a command-line option to list all supported march values.
Fortunately, it seems like some newer gcc versions provide a way to do so. If you do need such an option for old gccs, writing a gcc plugin, which might work from gcc 4.5 or so, could be taken into consideration:
gcc plugin
simple gcc plugin how to
Gcc plugins are plugged-in to an existing gcc by adding some command-line options. Gcc has APIs for plugins. All you need would be to write a code that checks the information such as gcc version, the arch that runs gcc, etc, and that prints out the list of the supported march.
Use the detailed help page:
gcc -v --help
Look for the option -march=CPU, for example in gcc v4.8.4
-march=CPU[,+EXTENSION...]
generate code for CPU and EXTENSION, CPU is one of:
generic32, generic64, i386, i486, i586, i686,
pentium, pentiumpro, pentiumii, pentiumiii, pentium4,
prescott, nocona, core, core2, corei7, l1om, k1om,
k6, k6_2, athlon, opteron, k8, amdfam10, bdver1,
bdver2, bdver3, btver1, btver2
EXTENSION is combination of:
8087, 287, 387, no87, mmx, nommx, sse, sse2, sse3,
ssse3, sse4.1, sse4.2, sse4, nosse, avx, avx2,
avx512f, avx512cd, avx512er, avx512pf, noavx, vmx,
vmfunc, smx, xsave, xsaveopt, aes, pclmul, fsgsbase,
rdrnd, f16c, bmi2, fma, fma4, xop, lwp, movbe, cx16,
ept, lzcnt, hle, rtm, invpcid, clflush, nop, syscall,
rdtscp, 3dnow, 3dnowa, padlock, svme, sse4a, abm,
bmi, tbm, adx, rdseed, prfchw, smap, mpx, sha,
clflushopt, xsavec, xsaves, prefetchwt1
Since GCC 4 there's a --target-help which prints the supported parameters for options including
-march
-mtune
-mabi
-masm
Other options which themselves are architecture-specific e.g. -msse2, -mavx2
I have been tweaking GCC 6.3.0 to get it to use the libc++ runtime instead of libstdc++ so that g++ can be used without worrying about C++ runtime incompatibilities:
https://github.com/RJVB/macstrop/tree/master/lang/gcc6
The tweak works, I can build and run KDE software using g++ against Qt5 and KF5 frameworks (and everything else) built with various clang versions.
What doesn't work is generating code that uses AVX and presumably most or all newer intrinsic instructions.
This is not a new issue that's never been invoked on here; it's answered here for instance: How to use AVX/pclmulqdq on Mac OS X
Evidently one can configure gcc to call the linked script instead of the actual as executable.
But can gcc not be configured to use another assembler altogether, like nasm, and would that solve this issue?
According to the ARM ARM, __ARM_NEON__ is defined when Neon SIMD instructions are available. I'm having trouble getting GCC to provide it.
Neon available on this BananaPi Pro dev board running Debian 8.2:
$ cat /proc/cpuinfo | grep neon
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
I'm using GCC 4.9:
$ gcc --version
gcc (Debian 4.9.2-10) 4.9.2
Try GCC and -march=native:
$ g++ -march=native -dM -E - </dev/null | grep -i neon
#define __ARM_NEON_FP 4
OK, try what Google uses for Android when building for Neon:
$ g++ -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=softfp -dM -E - </dev/null | grep -i neon
#define __ARM_NEON_FP 4
Maybe a ARMv7-a with a hard float:
$ g++ -march=armv7-a -mfloat-abi=hard -dM -E - </dev/null | grep -i neon
#define __ARM_NEON_FP 4
My questions are:
why am I not seeing __ARM_NEON__?
how do I detect Neon availability in the preprocessor?
And maybe:
what GCC switches should I use to enable Neon SIMD instructions?
Related, on a LeMaker HiKey, which is AARCH64/ARM64 running Linaro with GCC 4.9.2, here's the output from the preprocessor:
$ cpp -dM </dev/null | grep -i neon
#define __ARM_NEON 1
According to ARM, this board does have Advanced SIMD instructions even though:
$ cat /proc/cpuinfo
Processor : AArch64 Processor rev 3 (aarch64)
...
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
There are a number of questions hidden in here, I'll try to extract them in turn...
According to the ARM ARM, __ARM_NEON__ is defined when Neon SIMD instructions are available. I'm having trouble getting GCC to provide it.
That is compiler documentation for [an old version of] the ARM Compiler rather than the ARM Architceture Reference Manual. A better macro to check for the presence of the Advanced SIMD instructions would be __ARM_NEON, which is defined in the ARM C Language Extensions.
Try GCC and -march=native:
As you may have found. GCC for the ARM target separates out -march (For the architecture revision for which GCC should generate code), -mfpu (For the floating point/Advanced SIMD unit available) and -mfloat-abi (For how floating point arguments should be passed, and for the presence or absence of a floating point unit). Finally there is -mtune (Which asks GCC to try to optimise for a particular processor) and -mcpu (which acts as a combination of -mtune and -march).
By asking for -march=native You're asking GCC to generate code appropriate for the detected architecture of the processor on which you are running. This has no impact on the -mfpu setting, and so does not necessarily enable Advanced SIMD instruction generation.
Note that the above only applies to a compiler targeting AArch32. The AArch64 GCC does not support -mfpu and will detect presence of Advanced SIMD support through -march=native.
OK, try what Google uses for Android when building for Neon:
$ g++ -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=softfp -dM -E
These build flags are not sufficient to enable support for Advanced SIMD instructions, your notes may be incomplete. Of the -mfpu flags supported by GCC 4.9.2 I'd expect any of:
neon, neon-fp16, neon-vfpv4, neon-fp-armv8, crypto-neon-fp-armv8
To give you what you want.
According to ARM, this board does have Advanced SIMD instructions even though:
Looks like you're running on an AArch64 kernel, which exposes support for Advanced SIMD through the asimd feature - as in your example output.
I have code which produces executables larger than 2GB (it's generated code).
On x64 with gcc 4.3.2 I get errors like:
crtstuff.c:(.text+0x20): relocation truncated to fit:
R_X86_64_32S against `.dtors'
So I understand i need the -mcmodel=large option. However that doesn't do anything or solve the problem on my system.
I am sure I read somewhere, that it was only supported from a particular version of gcc, and the option was ignored on versions before that. I would tell my operations team to install that version of gcc if only I knew what it was. But I just can't find any evidence right now to tell me if that hypothesis is true, and if so in which version the feature was introduced.
For example
(1) Here it is stated that the option doesn't do anything. The book in question claims to cover "GCC 4.x". The book came out 2006.
(2) Here a compiler bug is being reported against the option, therefore I conclude in that version it must do at least something. That seems to be gcc 4.6.1.
So although I can no longer find evidence of exactly in which version the feature was implemented, at least there is evidence that this has changed over time.
I have tried looking through the changelogs for all the various GCC 4.x versions to no avail (and normally they are pretty good so the lack of information there almost implies that I am wrong and nothing has changed between versions.)
Edit: This seems to imply that perhaps it did work, but I need to "recompile crtstuff.c", but I don't really know where I find that file or how I do that.
I believe 4.4 is the version that added support for this feature. I demonstrate below that 4.1 doesn't work while 4.4 does, on something that needs a large data block (rather than code). I'm not sure about 4.2 and 4.3, but both your example and my memory suggest 4.3 didn't have working support for this. My example should let you validate whether a particular installation works or not though, on an otherwise easy to compile bit of code.
As background, I maintain a program that's a fork of the stream benchmark, modified specially to use 64 bit structures for testing larger systems. I was plagued with these "relocation truncated to fit" errors until I started using "-mcmodel=large", and my fork won't compile/run unless that really does work. The oldest version of gcc I've definitely found my program compatible with is the 4.4.5 that ships with Debian Squeeze.
Here's a complete test case showing my fork of stream compiling and using >4GB of RAM with the large model, after failing to do so without the option:
$ gcc --version
gcc (Debian 4.4.5-8) 4.4.5
...
$ git clone https://github.com/gregs1104/stream-scaling.git
$ cd stream-scaling
$ gcc -O3 -DN=200000000 -fopenmp stream.c -o stream
/tmp/cca8rR1I.o: In function `checkSTREAMresults':
stream.c:(.text+0x34): relocation truncated to fit: R_X86_64_32S against `.bss'
...
stream.c:(.text+0x6ab): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
$ gcc -O3 -DN=200000000 -fopenmp stream.c -o stream -mcmodel=large
$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
...
And here's what happens on a version of gcc that doesn't have the large model, one running RedHat 5 derived software (CentOS 5.8):
$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52)
...
$ gcc -O3 -DN=200000000 -fopenmp stream.c -o stream -mcmodel=large
stream.c:1: sorry, unimplemented: code model ‘large’ not supported yet
So on older versions of gcc, it should throw that error out, not just ignore the option.
crtstuff is a library coming with gcc. The bug report you linked to on the gcc mailing list was from someone trying to build their own gcc for a RedHat 5 system, which as you can see in this last example ships with gcc 4.1. They rebuilt part of gcc with the large model, but it was still linking against the original, 4.1 built crtstuff library. You shouldn't run into that problem if you're using a properly packaged gcc, which is why it wasn't considered a real bug by the gcc developers. I think you just need gcc 4.4 or later.