_mm512_mask_i32logather_pd not available for GNU compiler - gcc

I have a codebase which contains AVX512 intrinsic instructions and was build using intel compiler. I am trying to run the same thing using GNU compiler. While compiling the code with -mavx512f flag using gcc, I am getting declaration error only for some AVX512 instructions like _mm512_mask_i32logather_pd.
Standalone Implementation
#include <iostream>
#include <immintrin.h>
int main() {
__m512d set = _mm512_undefined_pd();
__mmask16 msk = 42440;
__m512i v_index = _mm512_set_epi32(64,66,70,96,98,100,102,104,106,112,114,116,118,120,124,256);
int scale = 8;
int count_size = 495*4;
float *src_ptr = (float*)malloc(count_size*sizeof(float));
__m512 out_512 = (__m512)_mm512_mask_i32logather_pd(set, msk, v_index, (float*)src_ptr, _MM_SCALE_8);
return 0;
}
After running this standalone implementation for the function through gcc I am getting the error as
error: ‘_mm512_mask_i32logather_pd’ was not declared in this scope; did you mean ‘_mm512_mask_i32gather_pd’?
Running the same code using icc with -xCORE-AVX512 flag runs perfectly fine.
Is this because the GNU compiler doesn't support all the AVX512 instructions even though most of the instructions works perfectly fine by using -mavx512f flag?
Relevant information
gcc version - 11.2.0
ubuntu version - 22.04
icc version 2021.6.0

GCC has intrinsics for all AVX-512 instructions. It doesn't always have every alternate version of every intrinsic that differ only in their C semantics, not the underlying instruction they expose.
I think the only difference between the regular _mm512_mask_i32gather_pd intrinsic (which GCC supports) is that logather takes a __m512i vindex instead of __m256i. But uses only the low half, hence the lo in the name. (I looked at them in the intrinsics guide - same pseudocode, just a difference in C/C++ function signature. And they're listed as intrinsics for the same single instruction). There doesn't seem to be a higather intrinsic that includes a shuffle; you need to do the extracting yourself.
vgatherdpd gathers 8 double elements to fill a __m512d, using 32-bit indices. The corresponding 8 indices are only a total of 32 bytes wide. That's why the regular more widely-supported intrinsic only takes a __m256i vindex arg.
Your code strangely bothers to initialize 64 bytes (16 indices), not shuffling the high half down. Also you're merge-masking into _mm512_undefined_pd(), which seems a weird example. But pretty obviously this isn't intended to be useful, since you're also loading from uninitialized malloc. You're casting the result to a __m512, I guess using this instruction to gather pairs of float instead of individual doubles? If so, yeah it's more efficient to gather fewer elements, but it's a weird way to make a minimal simple example for an intrinsic you're looking for. I wonder if perhaps you were looking for _mm512_mask_i32gather_ps to gather 16x float elements, merging into a __m512 vector. (The non-_mask_ version gathers all 16 elements, and you don't have to supply a merge target; that's often what you want.)
If you do have your 8 indices in a wider vector for some reason (e.g. as a result of computation and you're going to do 2 gathers after shuffling), you can just cast the vector type:
__m512i vindex = ...; // the part we want is only the low half
__m512d result = something to merge into;
result = _mm512_mask_i32gather_pd(result, mask, _mm512_castsi512_si256(vindex),
src_ptr, _MM_SCALE_8);
Your cast to (float*) in the arg list to the intrinsic makes no sense: it actually takes a void* so you can gather 64-bit chunks from anything (and yes it's strict-aliasing and alignment safe, not following C rules). But the normal type would be double*, since this is a _pd gather.
In your example, it would be simpler to just __m256 vindex = _mm256_setr_epi32(...); (Or set, if you like the highest-element-first order for the argument list.)

Related

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

Manipulating Masks for doubles on Xeon Phi

I am doing conditional computations on a Xeon Phi using intrinsic functions.
I have to use double values so i need a __mmask8.
As long as I use some of the compare functions there is no problem for me, but if I want to modify those masks I run into some type conflicts.
Where the documentation gives me plenty of functions to modify __mmask16 used for single precision there is not a single one usable for double precision.
I want to do someting like the following:
int tmp = 0;
for(i = 0; i < 8; i++) {
tmp = index[i];
tmp = tmp << 1;
}
__mmask8 something = _mm512_int2mask(tmp);
The documentation provides the given function only for a __mmask16.
The same comes with all manipulating functions in the Vector Mask Intrinsic chapter of the Documentation.
Can i use those functions as well?
Is there a convention like "use every second bit of a __mmask16" ?
Thanks in advance
According to http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vector-microarchitecture
Each VPU has 128 entry 512-bit vector registers divided up among the
threads, thus getting 32 entries per thread. These are
hard-partitioned. There are eight 16-bit mask registers per thread
which are part of the vector register file. The mask registers act as
a filter per element for the 16 elements and thus allows one to
control which of the 16 32-bit elements are active during a
computation. For double precision the mask bits are the bottom 8 bits.
Intel doesn't provide any intrinsics for operating on __mmask8 types; all of the intrinsics are for __mmask16. Therefore I assume that we're expected to just use the __mmask16 intrinsics for manipulating __mask8 types. This seems to work, but I've had very little experience with these so far.

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

Difference in integer size for 64-bit system(confuse with my old 32-bit pc system)

Few months ago i get myself a laptop with cpu intel i7-2630qm with a 64-bit windows. While practising my programming skils under this system , I encountered some difference in terms of integer size which makes me think that it's probably due to my new 64-bit system.
Let's take a look at a code.
The C Code :
#include <stdio.h>
int main(void)
{
int num = 20;
printf("%d %lld\n" , num , num);
return 0;
}
The Question :
1.) I remember before getting this new laptop , which mean that time i'm still using my old 32-bit system , when i run this code , the program will print the integer 20 while some random number next to it due to the %lld specifier.
2.)But this phenomena no longer happen when i'm using my new laptop , it will instead print both integer correctly , even if i change the variable num to type short.
3.)Is it on a 64-bit system , there's new integer promotion which will promote int to long long when it's use as an argument??Or is it short integer can be promoted to long long which is 64-bit too when pass as an argument??
4.)Besides that I'm quite confuse with one thing , on 16-bit system , int would be 16-bit and it would be 32-bit when it's on a 32-bit system.But why isn't it become 64-bit when it's on a 64-bit??
==================================================================================
Addon :
1.)I choose "console program(64-bit)" as my project on the IDE while using my new laptop but "console program" on my 32-bit old PC system.
2.)I've check the size of int under "console program(64-bit)" project using sizeof operator and it returns 32-bit while short still remain 16-bit.The only change is long type , it's 64-bit and long long still remain its usual 64-bit size.
You are seeing this side-effect because the calling convention is different for x64 code. The function arguments in 32-bit x86 code are passed on the stack. The printf() function will read a word from the stack that isn't part of the activation frame. The odds that it contains a value of 0 are extremely low.
In x64 code, the first 4 arguments for a function are passed through cpu registers, not the stack. The odds that the high word of the 64-bit register is zero by chance are quite good. Left there by a previous 64-bit operation that worked with small numbers. But certainly not guaranteed.
Trying to reason out the defined behavior of undefined behavior is otherwise not useful. Other than trying to guess how the language is implemented for the core that's in your machine. There are better resources for that. Learning the machine code that's applicable to your compiler is an excellent shortcut. Together with the decent debugger that shows you how your C code got translated into machine code. Machine code has no undefined behavior.
I do not have access to an windows 64-bit compiler right now, but my guess is the following.
Your question is not about integer promotion, but regarding how parameters are passed from the function caller to the called function. This is beyond the C specification, but it is interesting to know.
In 32-bit, all parameters are divided into 32-bit blocks as all registers can hold 32 bits. So in this case we have the following stack layout:
[ 32-bit format string pointer ][ num as 32-bit ][ num as 32-bit ] junk...
In 64-bit, all parameters are divided into 64-bit blocks as all registers can hold 64 bits. So the stack will contain the following:
[ 64-bit format string pointer ][ num as 64-bit ][ num as 64-bit ] junk...
The upper 32 bits of the 64-bit registers holding 32-bit values are conveniently set to zero.
So when printf is reading a 64-bit number, it will load the equivalent of two 32-bit registers on a 32-bit platform but only one 64-bit register, with high bits cleared, on a 64-bit platform.
(1 and 2) As already stated, the behaviour in this situation is undefined, so the compiler is allowed to behave differently for any reason or indeed no reason at all.
(3) The compiler is allowed to define int as 64-bit, in which case no promotion would be necessary because all the variables in question would be the same size. But it almost certainly doesn't.
(4) On most or all 64-bit compilers, int is 32-bits. This is because int has been 32 bits for so long that programmers have come to expect it and changing it would break existing code. As far as I know this isn't officially part of the standard, but it's one of those de-facto standards that are even harder to change. :-)
Everything you are describing is specific to whatever spec your compiler is using and the platform you are on (with the exception that long is guaranteed to be at least the same size as int):
Wikipedia entries:
long long
int
The c99 standard seeks to end this ambiguity by adding specific types; int32_t, uint64_t, etc. There's also a POSIX spec that defines u_int32_t, etc.
Edit: I missed the question about printf(), sorry. As #nos points out in the comments on your question, passing something other than a long long to %lld results in undefined behavior. This means there is no rhyme or reason as to what it will do; unicorns spontaneously appearing would not be out of the question.
Oh - and on every compiler and OS I know, int is 32 bit. Changing that has the potential to break things that depend on it being 32 bit.

How to force gcc to use all SSE (or AVX) registers?

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
Two points:
First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)
So, if you want better register allocation, you basically have two options:
write a better register allocator, and patch it into GCC, or
bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.
Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

Resources