GCC support for XMM registers badly broken? - gcc

Whenever I examine the assembly code produced by GCC for code that uses the __m128i type, I see what looks like a catastrophe. There's tons of redundant instructions that serve no purpose.
And yet, as an assembly programmer I'd rather use asm{} but GCC prevents me from using XMM registers in asm {}.
Is there some trick to getting GCC to use XMM or do I need to wait for a future release?
I've got 4.3.4.

Are you compiling with optimisation enabled, e.g. -O3 ? If so then gcc usually generates pretty decent SSE code from intrinsics. Most intrinsics map to exactly one SSE instruction. Can you give an example that you consider to be particularly inefficient ?
Also, I'm not sure what you mean about "GCC prevents me from using XMM registers in asm {}" - again, if you provide a specific example then perhaps there's an easy solution.

Related

GCC ARM SIMD intrinsics compiling to scalar instructions

I have a music synthesis app that runs on a RPi3 (Cortex-A53) in 32-bit mode, under a Yocto-based RTLinux. I'm using GCC 6.3 to compile the code, which uses tons of SIMD intrinsics in C++ to operate on float32x4_t and int32x4_t data. The code is instrumented so that I can see how long it takes to execute certain sizeable chunks of SIMD. It worked well until a couple days ago, when all of a sudden after fiddling unrelated stuff it slowed down by a factor of more than two.
I went in and looked at the code that was being generated. In the past, the code looked beautiful, very efficient. Now, it's not even using SIMD in most places. I checked the compiler options. They include -marm -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3. Occasionally you see a q register in the generated code, so it knows they exist, but mostly it operates on s registers. Furthermore, it uses lots of code to move pieces of q8-q15 (a.k.a. d16-d31) into general registers and then back into s0-s31 registers to operate on them, and then moves them back, which is horribly inefficient. Does anyone know any reason why the compiler should suddenly start compiling the float32x4_t and int32x4_t vector intrinsics into individual scalar ops? Or any way to diagnose this by getting the compiler to cough up some information about what's going on inside?
Edit: I found that in some places I was doing direct arithmetic on int32x4_t and float32x4_t types, while in other places I was using the ARM intrinsic functions. In the latter case, I was getting SIMD instructions but in the former it was using scalars. When I rewrote the code using all intrinsics, the SIMD instructions reappeared, and the execution time dropped close to what it was before. But I noticed that if I wrote something like x += y * z; the compiler would use scalars but was smart enough to use four VFMA instructions, while if I wrote x = vaddq_f32(x, vmulq_f32(y, z)); it would use VADDQ and VMULQ instructions. This explains why it isn't quite as fast as before when it was compiling arithmetic operators into SIMD.
So the question is now: Why was the compiler willing to compile direct arithmetic on int32x4_t and float32x4_t values into quad SIMD operations before, but not any more? Is there some obscure option that I didn't realize I had in there, and am now missing?

Clang vs gcc floating point performance on ARM

I was trying out clang compiler and wanted to check its performance vs tradational gcc. I found out that its performance in terms of floating point operations is very bad compared to gcc (almost 30%). I compared the assembly files of my code with clang and gcc and found that gcc was using F*** (e.g fcmpezd) functions while clanf was using V** functions (e.g. vcmpe.f64). Does this affect no of instructions cycles? I believe both instructions are aliases.
Also, in assembly file whenever a function is defined, arguments are pushed on stack. GCC was using stmfd instruction while clang was using push instruction followed by add interuction adding some value (or register content) to stack pointer (sp). Are the two set of instructions consume same cycle - (stmfd) and (push,add)?
I am using vfpv3 as option while compiling with both clang as well as gcc. Also please suggest some good tool which will tell me how many instruction cycles will an instruction consume.

Testing FPU on arm processor

I am using a Wandboard-Quad that contains an i.MX6 ARM processor. This processor has an FPU that I would like to utilize. Before I do, I want to test how much improvement I will get. I have a benchmark algorithm and have tried it with no optimization, and with -mfpu=vfp and there appears to be no improvement -- I do get improvement with optimization = 3.
I am using arm-linux-gnueabi libraries -- Any thoughts on what is incorrect and how I can tell if I am using the FPU?
Thanks,
Adam
Look at the assembler output with a -S flag and see if there are any fpu instructions being generated. That's probably the easiest thing.
Beyond that, there is a chance that your algorithm was using floating point so rarely that any use would be masked by loading and unloading the FPU registers. In that case, O3 optimizations in your other parts of the code would show you gains separate of the FPU usage.
-mfpu option works only when GCC is performing vectorization. Vectorization itself requires reasonable optimization level (minimum is -O2 with -ftree-vectorize option on). So try -O3 -ftree-vectorize -mfpu=vfp to utilize FPU and measure difference against simple -O3 level.
Also see ARM GCC docs for cases where -funsafe-math-optimizations may be required.
Without any optimisation the output from GCC is so inefficient that you might actually not be able to measure the difference between software and hardware floating point.
To see the benefits that the FPU adds, you need to test with a consistent optimisation level, and then use either -msoft-float or -mhard-float.
This will force the compiler to link against different libraries and make function calls for the floating-point operations rather than using native instructions. It is still possible that the underlying library uses hardware floating point, but I wouldn't worry about that too much.
You can select different sets of FP instructions using -mfpu=. For i.MX6 I think you want -mfpu=neon, as that should enable all applicable floating-point instructions (not just the NEON ones).

gcc, simd intrinsics and fast-math concepts

Hi all :)
I'm trying to get a hang on a few concepts regarding floating point, SIMD/math intrinsics and the fast-math flag for gcc. More specifically, I'm using MinGW with gcc v4.5.0 on a x86 cpu.
I've searched around for a while now, and that's what I (think I) understand at the moment:
When I compile with no flags, any fp code will be standard x87, no simd intrinsics, and the math.h functions will be linked from msvcrt.dll.
When I use mfpmath, mssen and/or march so that mmx/sse/avx code gets enabled, gcc actually uses simd instructions only if I also specify some optimization flags, like On or ftree-vectorize. In which case the intrinsics are chosen automagically by gcc, and some math functions (I'm still talking about the standard math funcs on math.h) will become intrinsics or optimized out by inline code, some others will still come from the msvcrt.dll.
If I don't specify optimization flags, does any of this change?
When I use specific simd data types (those available as gcc extensions, like v4si or v8qi), I have the option to call intrinsic funcs directly, or again leave the automagic decision to gcc. Gcc can still chose standard x87 code if I don't enable simd instructions via the proper flags.
Again, if I don't specify optimization flags, does any of this change?
Plese correct me if any of my statements is wrong :p
Now the questions:
Do I ever have to include x86intrin.h to use intrinsics?
Do I ever have to link the libm?
What fast-math has to do with anything? I understand it relaxes the IEEE standard, but, specifically, how? Other standard functions are used? Some other lib is linked? Or are just a couple of flags set somewhere and the standard lib behaves differently?
Thanks to anybody who is going to help :D
Ok, I'm ansewring for anyone who is struggling a bit to grasp these concepts like me.
Optimizations with Ox work on any kind of code, fpu or sse
fast-math seems to work only on x87 code. Also, it doesn't seem to change the fpu control word o_O
Builtins are always included. This behavior can be avoided for some builtins, with some flags, like strict or no-builtins.
The libm.a is used for some stuff that is not included in the glibc, but with mingw it's just a dummy file, so at the moment it's useless to link to it
Using the special vector types of gcc seems useful only when calling the intrinsics directly, otherwise the code gets vectorized anyway.
Any correction is welcomed :)
Useful links:
fpu / sse control
gcc math
and the gcc manual on "Vector Extensions", "X86 Built-in functions" and "Other Builtins"

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Resources