How to get GCC to use more than two SIMD registers when using intrinsics? - gcc

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 and XMM1. I am compiling for x86-64 so I have 15 registers. Why is GCC using only two and what can I do to ask it to use more? Is there any way that I can "pin" some value in a register? I added the "register" keyword to my variable definition, but the generated assembly code is identical.

Yes, you can. Explicit Reg Vars talks about the syntax you need to pin a variable to a specific register.

If you're getting to the point where you're specifying individual registers for each intrinsic, you might as well just write the assembly directory, especially given gcc's nasty habit of pessimizing intrinsics unnecessarily in many cases.

It sounds like you compiled with optimization disabled, so no variables are kept in registers between C statements, not even int.
Compile with gcc -O3 -march=native to let the compiler make non-terrible asm, optimized for your machine. The default is -O0 with a "generic" target ISA and tuning.
See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why "debug" builds in general are like that, and the fact that register int foo; or register __m128 bar; can stay in a register even in a debug build. But it's much better to actually have the compiler optimize, as well as using registers, if you want your code to run fast overall!

Related

The Effect of Architecture When Using SSE / AVX Intrinisics

I wonder how does a Compiler treats Intrinsics.
If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?
If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?
How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?
For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.
The Context
I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.
How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?
GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)
MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.
You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.
For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.
See also
https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.
See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.
With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.
Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL // empty
// maybe warn if __AVX__ isn't defined for functions where this is used?
// if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif
TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
for (size_t i = 0 ; i<1024 ; i++) {
__m256 v = _mm256_loadu_ps(src);
...
...
}
}
With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.
ICC pragma:
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.
#pragma intel optimization_parameter target_arch=AVX
For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)
If you compile code with -mavx2 enabled your compiler will (usually) generate so-called "VEX encoded" instructions. In case of _mm_loadu_ps, this will generate vmovups instead of movups, which is almost equivalent, except that the latter will only modify the lower 128 bit of the target register, whereas the former will zero-out everything above the lower 128 bits. However, it will only run on machines which support at least AVX. Details on [v]movups are here.
For other instructions like [v]addps, AVX has the additional advantage of allowing three operands (i.e., the target can be different from both sources), which in some cases can avoid copying registers. E.g.,
_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));
requires a register copy (movaps) when compiled for SSE, but not when compiled for AVX:
https://godbolt.org/z/YHN5OA
Regarding using AVX-intrinsics but compiling without AVX, compilers either fail (like gcc/clang) or silently generate the corresponding instructions which would then fail on machines without AVX support (see #PeterCordes answer for details on that).
Addendum: If you want to implement different functions depending on the architecture (at compile-time) you can check that using #ifdef __AVX__ or #if defined(__AVX__): https://godbolt.org/z/ZVAo-7
Implementing them in the same compilation unit is difficult, I think. The easiest solutions are to built different shared-libraries or even different binaries and have a small binary which detects the available CPU features and loads the corresponding library/binary. I assume there are related questions on that topic.

How to de-optimize the Linux kernel to avoid value optimized out

I am debugging Linux kernel.
I compile the kernel with -O1 optimization level. (Note that the Linux kernel cannot be compiled with -O0).
When using gdb to debug it, I found some values are optimized out. As shown in the following figure. The len, flags, and add_len arguments are all optimized out.
How can I de-optimize the Linux kernel to avoid making these variables being optimized out?
Building with -Og is supposed to eliminate these problems.
I don't know whether Linux kernel can be compiled that way.
Note that often you can discover the "optimized out" value by going up or down the stack, e.g. if the caller looks like this:
udp_recvmsg(sk, foo->msg, foo->msglen, ...);
then looking at *foo in the caller will tell you the len despite it being optimized out in the udp_recvmsg itself.

change instruction set in GCC

I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.

Restrict SSE instruction set

I want my compiler to use only instructions of the specified version of SSE.
For now, looks like -msse2 -mno-sse3 -mno-sse4 -mno-sse41 -mno-sse42 does it, however I'm looking for something like -monly-sse2.
Unless you specify -msse3/-march=<cpu-with-sse3> only SSE2 will be used on x86-64 (and even lower instruction sets on x86).

Optimizing used registers when using inline ARM assembly in GCC

I want to write some inline ARM assembly in my C code. For this code, I need to use a register or two more than just the ones declared as inputs and outputs to the function. I know how to use the clobber list to tell GCC that I will be using some extra registers to do my computation.
However, I am sure that GCC enjoys the freedom to shuffle around which registers are used for what when optimizing. That is, I get the feeling it is a bad idea to use a fixed register for my computations.
What is the best way to use some extra register that is neither input nor output of my inline assembly, without using a fixed register?
P.S. I was thinking that using a dummy output variable might do the trick, but I'm not sure what kind of weird other effects that will have...
Ok, I've found a source that backs up the idea of using dummy outputs instead of hard registers:
4.8 Temporary registers:
People also sometimes erroneously use clobbers for temporary registers. The right way is
to make up a dummy output, and use “=r” or “=&r” depending on the permitted overlap
with the inputs. GCC allocates a register for the dummy value. The difference is that
GCC can pick a convenient register, so it has more flexibility.
from page 20 of this pdf.
For anyone who is interested in more info on inline assembly with GCC this website turned out to be very instructive.

Resources