Why don't gcc/clang vectorize 128-bit SIMD intrinsics into 256-bit when possible? - gcc

Suppose I have this function:
void test32(int* a, int* b, size_t n) {
for (size_t i = 0; i < n; ++i) {
a[i] = a[i] + b[i];
}
}
Clang and gcc both produce 256-bit SIMD when compiled with -O3 -march=core-avx2 (godbolt).
Now suppose I have this function:
void test128(__m128i* a, __m128i* b, size_t n) {
for (size_t i = 0; i < n; ++i) {
a[i] = _mm_add_epi32(a[i], b[i]);
}
}
With the same CFLAGS, clang and gcc both refuse to vectorize this to 256-bit (godbolt).
The naive code (auto-vectorized) therefore processes twice as many elements per iteration compared to the manually vectorized SSE2 code. How does this make sense? Is there a way to instruct the compiler to vectorize 128-bit SIMD intrinsics into 256-bit when AVX2 is available?

Unfortunately no, I don't know of a compiler option to re-vectorize intrinsics (or GNU C native vectors) to a wider type. That's one reason not to manually vectorize in the first place for cases that easily auto-vectorize.
It's sometimes useful to be able to tell the compiler what vectorization strategy you want it to use, and that's what intrinsics are for.
If compilers rewrote them too aggressively, that would be bad in some cases. Like cleanup loops following a loop with wider vectors, you might use 128-bit to leave fewer scalar elements after. Or maybe you have 16-byte alignment but not 32, and you care about Sandybridge specifically (where misaligned 32-byte load/store are quite bad). Or you're on Haswell server where 256-bit AVX can reduce max turbo (at least for FP math instructions), so you only want to use 256-bit vectors in some functions that will be running in some phases of your program.
Basically it's a tradeoff between how close to writing in asm it is, for clever humans to specify what they want, vs. just giving a way to tell the compiler about the program logic in a way it can understand and optimize (like the + operator: that doesn't mean you'll get an asm add instruction).
MSVC and ICC tend to take intrinsics even more literally than GCC/clang, not doing constant propagation through them. GCC/clang's choice of how to treat intrinsics is sensible in a lot of ways.
If you have a trivially vectorizable problem like this (no loop carried dependencies or shuffles), and you want your code to be compilable for future wider vector instructions, use OpenMP #pragma omp simd to tell the compiler you definitely want it vectorized. This kind of problem is what OpenMP is for, if you don't enable full optimization to get compilers to try to auto-vectorize every loop. (At gcc -O3, or clang -O2. Also -O2 for GCC12.) OpenMP can get the compiler to vectorize FP math in ways that it normally wouldn't be allowed to without -ffast-math, e.g. FP reductions like sum of an array.

Related

_mm512_mask_i32logather_pd not available for GNU compiler

I have a codebase which contains AVX512 intrinsic instructions and was build using intel compiler. I am trying to run the same thing using GNU compiler. While compiling the code with -mavx512f flag using gcc, I am getting declaration error only for some AVX512 instructions like _mm512_mask_i32logather_pd.
Standalone Implementation
#include <iostream>
#include <immintrin.h>
int main() {
__m512d set = _mm512_undefined_pd();
__mmask16 msk = 42440;
__m512i v_index = _mm512_set_epi32(64,66,70,96,98,100,102,104,106,112,114,116,118,120,124,256);
int scale = 8;
int count_size = 495*4;
float *src_ptr = (float*)malloc(count_size*sizeof(float));
__m512 out_512 = (__m512)_mm512_mask_i32logather_pd(set, msk, v_index, (float*)src_ptr, _MM_SCALE_8);
return 0;
}
After running this standalone implementation for the function through gcc I am getting the error as
error: ‘_mm512_mask_i32logather_pd’ was not declared in this scope; did you mean ‘_mm512_mask_i32gather_pd’?
Running the same code using icc with -xCORE-AVX512 flag runs perfectly fine.
Is this because the GNU compiler doesn't support all the AVX512 instructions even though most of the instructions works perfectly fine by using -mavx512f flag?
Relevant information
gcc version - 11.2.0
ubuntu version - 22.04
icc version 2021.6.0
GCC has intrinsics for all AVX-512 instructions. It doesn't always have every alternate version of every intrinsic that differ only in their C semantics, not the underlying instruction they expose.
I think the only difference between the regular _mm512_mask_i32gather_pd intrinsic (which GCC supports) is that logather takes a __m512i vindex instead of __m256i. But uses only the low half, hence the lo in the name. (I looked at them in the intrinsics guide - same pseudocode, just a difference in C/C++ function signature. And they're listed as intrinsics for the same single instruction). There doesn't seem to be a higather intrinsic that includes a shuffle; you need to do the extracting yourself.
vgatherdpd gathers 8 double elements to fill a __m512d, using 32-bit indices. The corresponding 8 indices are only a total of 32 bytes wide. That's why the regular more widely-supported intrinsic only takes a __m256i vindex arg.
Your code strangely bothers to initialize 64 bytes (16 indices), not shuffling the high half down. Also you're merge-masking into _mm512_undefined_pd(), which seems a weird example. But pretty obviously this isn't intended to be useful, since you're also loading from uninitialized malloc. You're casting the result to a __m512, I guess using this instruction to gather pairs of float instead of individual doubles? If so, yeah it's more efficient to gather fewer elements, but it's a weird way to make a minimal simple example for an intrinsic you're looking for. I wonder if perhaps you were looking for _mm512_mask_i32gather_ps to gather 16x float elements, merging into a __m512 vector. (The non-_mask_ version gathers all 16 elements, and you don't have to supply a merge target; that's often what you want.)
If you do have your 8 indices in a wider vector for some reason (e.g. as a result of computation and you're going to do 2 gathers after shuffling), you can just cast the vector type:
__m512i vindex = ...; // the part we want is only the low half
__m512d result = something to merge into;
result = _mm512_mask_i32gather_pd(result, mask, _mm512_castsi512_si256(vindex),
src_ptr, _MM_SCALE_8);
Your cast to (float*) in the arg list to the intrinsic makes no sense: it actually takes a void* so you can gather 64-bit chunks from anything (and yes it's strict-aliasing and alignment safe, not following C rules). But the normal type would be double*, since this is a _pd gather.
In your example, it would be simpler to just __m256 vindex = _mm256_setr_epi32(...); (Or set, if you like the highest-element-first order for the argument list.)

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

I was benchmarking some counting in a loop code.
g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).
What I find in this asm output funny is that code jumps over one simple add.
=> 0x42b46b <benchmark_many_ints()+1659>: movslq (%rdx),%rax
0x42b46e <benchmark_many_ints()+1662>: mov %rax,%rcx
0x42b471 <benchmark_many_ints()+1665>: imul %r9,%rax
0x42b475 <benchmark_many_ints()+1669>: shr $0xe,%rax
0x42b479 <benchmark_many_ints()+1673>: and $0x1ff,%eax
0x42b47e <benchmark_many_ints()+1678>: cmp (%r10,%rax,4),%ecx
0x42b482 <benchmark_many_ints()+1682>: jne 0x42b488 <benchmark_many_ints()+1688>
0x42b484 <benchmark_many_ints()+1684>: add $0x1,%rbx
0x42b488 <benchmark_many_ints()+1688>: add $0x4,%rdx
0x42b48c <benchmark_many_ints()+1692>: cmp %rdx,%r8
0x42b48f <benchmark_many_ints()+1695>: jne 0x42b46b <benchmark_many_ints()+1659>
Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction.
I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.
edit: source:
https://godbolt.org/z/v0Iiv4
The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)
branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.
I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.
-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.
BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)

Why doesn't my OpenMP simd directive have any use?

I have tried these codes to test SIMD directive in OpenMP.
#include <iostream>
#include <sys/time.h>
#include <cmath>
#define N 4096
#define M 1000
using namespace std;
int main()
{
timeval start,end;
float a[N],b[N];
for(int i=0;i<N;i++)
b[i]=i;
gettimeofday(&start,NULL);
for(int j=0;j<M;j++)
{
#pragma omp simd
for(int i=0;i<N;i++)
a[i]=pow(b[i],2.1);
}
gettimeofday(&end,NULL);
int time_used=1000000*(end.tv_sec-start.tv_sec)+(end.tv_usec-start.tv_usec);
cout<<"time_used="<<time_used<<endl;
return 1;
}
But either I compiled it by
g++ -fopenmp simd.cpp
or
g++ simd.cpp
their reports for "time_used" are almost the same.It looks like the SIMD directive I used doesn't have any use?
Thanks!
Additional questions:
I replaced
a[i]=pow(b[i],2.1);
by
a[i]=b[i]+2.1;
and when I compile them by
g++ -fopenmp simd.cpp
the output of "time_used" is about 12000.
When I compile them by
g++ simd.cpp
the output of "time_used" is about 12000,almost the same as before.
My computer: Haswell i5,8g RAM,ubuntu kylin 16.04,gcc 5.4.0
The compiler can't auto-vectorize function calls. It can only vectorize specific arithmetic operations that can be done using SIMD instructions.
Therefore, you need a vector math library that implements the pow function using SIMD instructions. Intel provides one. I'm not sure if pow is one of the functions that it offers with vector optimizations, but I imagine it is. You should also beware that Intel's math library may not be optimal on AMD processors.
You claim that you tried changing the pow function call to a simple addition, but didn't see any improvement in the results. I'm not quite sure how that is possible, because if you change the inner loop from:
a[i]=pow(b[i],2.1);
to, say:
a[i] += b[i];
or:
a[i] += (b[i] * 2);
then GCC, with optimizations enabled, notices that you never use the result and elides the entire thing. It was unable to perform this optimization with the pow function call, because it didn't know whether the function had any other side-effects. However, with code that is visible to the optimizer, it can…well, optimize it. In some cases, it might be able to vectorize it. In this case, it was able to remove it entirely.
If you tried code where the optimizer removed this loop entirely, and you still didn't see an improvement on your benchmark scores, then clearly this is not a bottleneck in your code and you needn't worry about trying to vectorize it.

How to force gcc to use all SSE (or AVX) registers?

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
Two points:
First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)
So, if you want better register allocation, you basically have two options:
write a better register allocator, and patch it into GCC, or
bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.
Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

Resources