When __builtin_memcpy is replaced with libc's memcpy - gcc

There is a version of C99/posix memcpy function in GCC: __builtin_memcpy.
Sometimes it can be replaced by GCC to inline version of memcpy and in other cases it is replaced by call to libc's memcpy. E.g. it was noted here:
Finally, on a compiler note, __builtin_memcpy can fall back to emitting a memcpy function call.
What is the logic in this selection? Is it logic the same in other gcc-compatible compilers, like clang/llvm, intel c++ compiler, PCC, suncc (oracle studio)?
When I should prefer of using __builtin_memcpy over plain memcpy?

I had been experimenting with the builtin replacement some time ago and I found out that the <string.h> functions are only replaced when the size of the source argument can be known at compile time. In which case the call to libc is replaced directly by unrolled code.
Unless you compile with -fno-builtin, -ansi, -std=c89 or something similar, it actually doesn't matter wether you use the __builtin_ prefix or not.
Although it's hard to follow, the code that deciedes whether to emit a library call or a chunk of code seems to be here.

Related

The Effect of Architecture When Using SSE / AVX Intrinisics

I wonder how does a Compiler treats Intrinsics.
If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?
If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?
How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?
For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.
The Context
I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.
How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?
GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)
MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.
You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.
For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.
See also
https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.
See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.
With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.
Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL // empty
// maybe warn if __AVX__ isn't defined for functions where this is used?
// if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif
TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
for (size_t i = 0 ; i<1024 ; i++) {
__m256 v = _mm256_loadu_ps(src);
...
...
}
}
With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.
ICC pragma:
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.
#pragma intel optimization_parameter target_arch=AVX
For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)
If you compile code with -mavx2 enabled your compiler will (usually) generate so-called "VEX encoded" instructions. In case of _mm_loadu_ps, this will generate vmovups instead of movups, which is almost equivalent, except that the latter will only modify the lower 128 bit of the target register, whereas the former will zero-out everything above the lower 128 bits. However, it will only run on machines which support at least AVX. Details on [v]movups are here.
For other instructions like [v]addps, AVX has the additional advantage of allowing three operands (i.e., the target can be different from both sources), which in some cases can avoid copying registers. E.g.,
_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));
requires a register copy (movaps) when compiled for SSE, but not when compiled for AVX:
https://godbolt.org/z/YHN5OA
Regarding using AVX-intrinsics but compiling without AVX, compilers either fail (like gcc/clang) or silently generate the corresponding instructions which would then fail on machines without AVX support (see #PeterCordes answer for details on that).
Addendum: If you want to implement different functions depending on the architecture (at compile-time) you can check that using #ifdef __AVX__ or #if defined(__AVX__): https://godbolt.org/z/ZVAo-7
Implementing them in the same compilation unit is difficult, I think. The easiest solutions are to built different shared-libraries or even different binaries and have a small binary which detects the available CPU features and loads the corresponding library/binary. I assume there are related questions on that topic.

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

How does gcc's linktime optimisation (-flto flag) work

I understand more or less the idea: When compiling separate modules and producing assembly code, functions calling each other have to respect strictly the calling convention, which kills the opportunity for many optimisations when compiling separate modules.
For instance if I have function A which calls function B which calls function C, all 3 in their own separate source files, it becomes possible to allocate registers evenly within the functions so that no register saving on the stack is necessary at all during those calls. With traditional compile-assembly-linking this is not possible, as the caller-saved and callee-saved registers are imposed by the calling convention.
Another optimisation is to inline functions which are called only once. This previously was possible only if a function is local, but thanks to linktime optimisation it's now possible even if the function is in another source file.
Now, if I compile with both -flto and -S flags, I see that instead of normal assembly instructions, gcc generates an encoded representation of the program, such as this:
.section .gnu.lto_.inline.c3c5e6ef8ec983c,"dr0"
.ascii "x\234mQ;N\303#\20}\273\353\17\370C\234\20\242`\"!Q\20\11Ah\322&\25\242\314\231|\4\32\220\220(,$.#\205D\343\3P Z.\341Tn\231\35\274\31L\342\342\355\314\274\371<\317\30\354\376\356\365\357\333\7\262"
.ascii "1\240G\325\273\202\7\216\232\204\36\205"
.ascii "8\242\370\240|\222"
.ascii "8\374\21\205ty\352\"*r\340!:!n\357n%]\224\345\10|\304\23\342\274z\346"
.ascii "8\35\23\370\7\4\1\366s\362\203j\271]\27bb{\316\353\27\343\310\4\371\374\237*n#\220\342rA\31"
.ascii "7\365\263\327\231\26\364\10"
.ascii "2\\-\311\277\255^w\220}|\340\233\306\352\263\362Qo+e+\314\354\277\246\354\252\277\20\364\224%T\233'eR\301{\32\340\372\313\362\263\242\331\314\340\24\6\21s\210\243!\371\347\325\333&m\210\305\203\355\277*\326\236\34\300-\213\327\306\2Td\317\27\231\26tl,\301\26\21cd\27\335#\262L\223"
.ascii "8\353\30\351\264{I\26\316\11\14"
.ascii "9\326h\254\220B}6a\247\13\353\27M\274\231"
.ascii "0\23M\332\272\272%d[\274\36Q\200\37\321\1&\35"
Since the data is in its own particular section, the linker sees this, and does the code generation. If the module was written in either assembly or with no -flto flag, then the linker would see data in the .text section instead, so there is no confusion possible for the linker.
The problem is: How can the linker generate code? Normally only gcc can generate code, the linker's role is just here to change a few offsets and adapt the binary format. In order to generate code, the linker would need to contain a second copy of the entire gcc backend (half of the compiler which generates assembly code from intermediate representation), as well as the entire assembler (since no assembly code was produced). How is such a thing possible, especially considering that binutils is a completely separate entity from gcc, developed by different teams?
GCC's -flto emits a serialized form of GCC's internal representation, as you discovered.
Then, at link time, the linker reinvokes GCC and passes it the objects that need final compilation. GCC reads the internal representation and does the work.
I think the actual work is done in collect2, which is part of GCC that is used when invoking the linker (I'm a little fuzzy on the details). There is also a "linker plugin" system that enables this to work a little better (like letting the linker decide how to split the compilation). This is implemented at least by the binutils ld and by gold; but as far as I recall this is just an optimization and isn't needed to get the basic -flto feature to work. You can see a bit more information on the original LTO project page; and maybe links from there would explain more.
There is more overlap between the GCC and binutils teams than you might think. The two projects share some code and have a long history of working together. Some people work on both projects.
From https://gcc.gnu.org/wiki/LinkTimeOptimization:
Despite the "link time" name, LTO does not need to use any special
linker features. The basic mechanism needed is the detection of GIMPLE
sections inside object files. This is currently implemented in
collect2 [which is called by gcc; -ps]. Therefore, LTO will work on any linker already supported by
GCC.
I assume this means you must link calling the compiler driver gcc. Simply linking with the system's vanilla linker wouldn't optimize the whole program, as you already concluded.
Update:
https://gcc.gnu.org/onlinedocs/gccint/Collect2.html says
The program collect2 is installed as ld in the directory where the
passes of the compiler are installed. When collect2 needs to find the
real ld it tries the following file names: [...]
(The page goes on detailing how collect2 looks for configuration-dependent executables and ones with well-known names like real-ld, finally even ld; but will not call itself recursively.)

Why does GCC use frame pointer when I call Win32 functions with arguments?

When I compile 32-bit C code with GCC and the -fomit-frame-pointer option, the frame pointer (ebp) is not used unless my function calls Windows API functions with stdcall and atleast one parameter.
For example, if I only use GetCommandLine() from the Windows API, which has no parameters/arguments, GCC will omit the frame pointer and use ebp for other things, speeding up the code and not having that useless prologue.
But the moment I call a stdcall Win32 function that accepts at least one argument, GCC completely ignores the -fomit-frame-pointer and uses the frame pointer anyway, and the code is worse in inspection as it can't use ebp for general purpose things. Not to mention I find the frame pointer quite pointless. I mean, I want to compile for release and distribution, why should I care about debugging? (if I want to debug I'll just use a debug build instead after reproducing the bug)
My stack most certainly does NOT contain dynamic allocation like alloca. So, the stack has a defined structure yet GCC chooses the dumb method despite my options? Is there something I'm missing to force it to not use frame pointer?
My second grip I have with it is that it refuses to use "push" instructions for Win32 functions. Every other compiler I tried, they used push instructions to push on the stack, resulting in much better more compact code, not to mention it is the most natural way to push arguments for stdcall. Yet GCC stubbornly uses "mov" instructions to move in each spot, manually, at offsets relative to esp because it needs to keep the stack pointer completely static. stdcall is made to be easy on the caller, and yet GCC completely misses the point of stdcall since it generates this crappy code when interfacing with it. What's worse, since the stack pointer is static, it still uses a frame pointer? Just why?
I tried -mpush-args, it doesn't do anything.
I also noticed that if I make my stack big enough for it to exceed a page (4096 bytes), GCC will add a prologue with a function that does nothing but "bitwise or" the stack every 4096 bytes with zero (which does nothing). I assume it's for touching the stack and automatically commiting memory with page faults if the stack was reserved? Unfortunately, it does this even if I set the initial commit of the stack (not reserve) to high enough to hold my stack, not to mention this shouldn't even be needed in the first place. Redundant code at its best.
Are these bugs in GCC? Or something I'm missing in options? Should I use something else? Please tell me if I'm missing some options.
I seriously hope I won't have to make an inline asm macro just to call stdcall functions and use push instructions (and this will avoid frame pointer too I guess). That sounds really overkill for something so basic that should be in compilers of today. And yes I use GCC 4.8.1 so not an ancient version.
As extra question, is it possible to force GCC to not save registers on the stack at function prologue? I use my own direct entry point with -nostartfiles argument, because it is a pure Windows application and it works just fine without standard lib startup. If I use attribute((noreturn)), it will discard the epilogue restoring the registers but it will still push them on the stack at prologue, I don't know if there's a way to force it to not save registers for this entry point function. Either way not a big deal in the least, it would just feel more complete I guess. Thanks!
See the answer Force GCC to push arguments on the stack before calling function (using PUSH instruction)
I.e. try -mpush-args -mno-accumulate-outgoing-args. It may also require -mno-stack-arg-probe if gcc complains.
It looks like supplying the -mpush-args -mno-accumulate-outgoing-args -mno-stack-arg-probe works, specifically the last one. Now the code is cleaner and more normal like other compilers, and it uses PUSH for arguments, even makes it easier to track in OllyDbg this way.
Unfortunately, this FORCES the stupid frame pointer to be used, even in small functions that absolutely do not need it at all. Seriously is there a way to absolutely force GCC to disable the frame pointer?!

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.
Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.
If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.
It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Resources