Is tr1 array supposed to be 16 byte aligned? - gcc

In "gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)" in tr1 array, I see this:
value_type _M_instance[_Nm ? _Nm : 1] __attribute__((__aligned__));
whereas in "gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)", I see this:
value_type _M_instance[_Nm ? _Nm : 1];
that is, it seems that tr1 arrays are no longer specified as aligned (which affects SSE code written for them). Some of our unit tests are failing in _mm_load_ps. Is there discussion of this change anywhere?

The specification doesn't specify that tr1::array is 16-byte aligned. The only guarantee is that the array will be aligned properly according to the size of value_type. Unless the sizes of objects that you are storing in the arrays are such that value_type alignment is a multiple of 16 bytes, then you will not get the 128-bit alignment that you desire to use SSE instructions. If you have existing code that relies upon the fact that one compiler used 16-byte alignment for all array instances, then you should fix it. You're taking advantage of behavior that is in excess of what the standard defines, which is very fragile.
If you have code that relies upon a specific amount of alignment on the memory that it uses, then you should explicitly enforce that alignment when you allocate the memory; anything less is prone to errors if you change compilers or platforms. A previous question addresses how to make tr1::array objects use aligned memory.

Related

_mm512_mask_i32logather_pd not available for GNU compiler

I have a codebase which contains AVX512 intrinsic instructions and was build using intel compiler. I am trying to run the same thing using GNU compiler. While compiling the code with -mavx512f flag using gcc, I am getting declaration error only for some AVX512 instructions like _mm512_mask_i32logather_pd.
Standalone Implementation
#include <iostream>
#include <immintrin.h>
int main() {
__m512d set = _mm512_undefined_pd();
__mmask16 msk = 42440;
__m512i v_index = _mm512_set_epi32(64,66,70,96,98,100,102,104,106,112,114,116,118,120,124,256);
int scale = 8;
int count_size = 495*4;
float *src_ptr = (float*)malloc(count_size*sizeof(float));
__m512 out_512 = (__m512)_mm512_mask_i32logather_pd(set, msk, v_index, (float*)src_ptr, _MM_SCALE_8);
return 0;
}
After running this standalone implementation for the function through gcc I am getting the error as
error: ‘_mm512_mask_i32logather_pd’ was not declared in this scope; did you mean ‘_mm512_mask_i32gather_pd’?
Running the same code using icc with -xCORE-AVX512 flag runs perfectly fine.
Is this because the GNU compiler doesn't support all the AVX512 instructions even though most of the instructions works perfectly fine by using -mavx512f flag?
Relevant information
gcc version - 11.2.0
ubuntu version - 22.04
icc version 2021.6.0
GCC has intrinsics for all AVX-512 instructions. It doesn't always have every alternate version of every intrinsic that differ only in their C semantics, not the underlying instruction they expose.
I think the only difference between the regular _mm512_mask_i32gather_pd intrinsic (which GCC supports) is that logather takes a __m512i vindex instead of __m256i. But uses only the low half, hence the lo in the name. (I looked at them in the intrinsics guide - same pseudocode, just a difference in C/C++ function signature. And they're listed as intrinsics for the same single instruction). There doesn't seem to be a higather intrinsic that includes a shuffle; you need to do the extracting yourself.
vgatherdpd gathers 8 double elements to fill a __m512d, using 32-bit indices. The corresponding 8 indices are only a total of 32 bytes wide. That's why the regular more widely-supported intrinsic only takes a __m256i vindex arg.
Your code strangely bothers to initialize 64 bytes (16 indices), not shuffling the high half down. Also you're merge-masking into _mm512_undefined_pd(), which seems a weird example. But pretty obviously this isn't intended to be useful, since you're also loading from uninitialized malloc. You're casting the result to a __m512, I guess using this instruction to gather pairs of float instead of individual doubles? If so, yeah it's more efficient to gather fewer elements, but it's a weird way to make a minimal simple example for an intrinsic you're looking for. I wonder if perhaps you were looking for _mm512_mask_i32gather_ps to gather 16x float elements, merging into a __m512 vector. (The non-_mask_ version gathers all 16 elements, and you don't have to supply a merge target; that's often what you want.)
If you do have your 8 indices in a wider vector for some reason (e.g. as a result of computation and you're going to do 2 gathers after shuffling), you can just cast the vector type:
__m512i vindex = ...; // the part we want is only the low half
__m512d result = something to merge into;
result = _mm512_mask_i32gather_pd(result, mask, _mm512_castsi512_si256(vindex),
src_ptr, _MM_SCALE_8);
Your cast to (float*) in the arg list to the intrinsic makes no sense: it actually takes a void* so you can gather 64-bit chunks from anything (and yes it's strict-aliasing and alignment safe, not following C rules). But the normal type would be double*, since this is a _pd gather.
In your example, it would be simpler to just __m256 vindex = _mm256_setr_epi32(...); (Or set, if you like the highest-element-first order for the argument list.)

Compiler assumptions about relative locations from memory objects

I wonder what assumptions compilers make about the relative locations of memory objects.
For example if we allocate two stack variables of size 1 byte each, right after another and initialize them both with zero, can a compiler optimize this case by only emitting one single instruction that overwrites both bytes in memory with zeros, because the compiler knows the relative position of both variables?
I am interested specifically in the more well known compilers like gcc, g++, clang, the Windows C/C++ compiler etc.
A compiler can optimize multiple assignments into one.
a = 0;
b = 0;
might become something like
*(short*)&a = 0;
The subtle part is "if we allocate two stack variables of size 1 byte each, right after another" since you cannot really do that. A compiler can shuffle stack positions around at will. Also, simply declaring variables will not necessarily mean any stack allocation. Variables might just be in registers. In C you would have to use alloca and even that does not provide "right after another".
Even more general, the C standard does not allow you to compare the memory positions of different objects. This is undefined behavior.

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

Size reduction for enum storage in Fujitsu Softune

Fujitsu microcontroller used is 32bit.
Hence enum storage is also 32bit. But in my project actually enum elements do not exceed more than 256.
Is there any compiler options to size down the storage for enums?
You could use a bit field to be able to store 256 unique values in 8 words (256 bits / 32 bit words = 8), but then the compiler will no longer be able to enforce that only a single bit is set at a time. But, you could easily write a wrapper function to clear out all the previous bits before setting one. It would probably end up kind of messy, but that's what tends to happen when you start using these kinds of tricks at this level to save memory.
You could use preprocessor macros (#define) to map symbolic names to values. without knowing what your application is, it's hard to predict if this is sensible :)

C18 compiler typedef enum data size

I'm trying to port code over to compile using Microchip's C18 compiler for a PIC microcontroller. The code includes enums with large values assigned (>8-bit). They are not working properly, indicating that, for example, 0x02 is the same as 0x2002.
How can I force the enumerated values to be referenced as 16-bit values?
In the DirectX headers, every enum has a FORCE_DWORD value in it with a value of 0xffffffff. I guess that's basically what you want, it forces to compiler to let the enum have at least 32 bits. So try adding a FORCE_WORD with a value of 0xffff.
This won't solve your problem, of course, if that compiler just does not support enums greater than 8 bits.
I found the problem.
For future reference, the C18 compiler will NOT promote variables OR constants when performing a math operation, even though it is ANSI C standard. This is to increase speed while running on 8-bit processors.
To force ANSI compliance, use the "-Oi" compiler option.
See page 92 of the C18 manual.

Resources