The simdlen value when a processor core has multiple vector pipelines - openmp

I am reading the OpenMP 4.5 standard and trying to make my mind about the !$omp simd / #pragma omp simd directive. Specifically, that is not clear for me what are the allowed simdlen values.
If I have a processor core with one floating point unit (FPU) capabable of 256-bit vector operations, I would use simdlen(4) for 64-bit floating point variables.
But what simdlen value should I use if a core has two independent vector pipelines with 128-bit registers?

tl;dr:
The standard makes no connection between specific hardware architectures and the simdlen clause of the simd construct, so it's implementation defined.
I would first add the question: Do you need to use simdlen at all?
From my experience with different implementations with AVX2 and AVX-512, I'd say: no, it is no necessary in order to utilise both VPUs per core on Xeon and Xeon Phi, but it can be somewhat beneficial for the performance of the generated code to use twice the native register size as argument. I think the intended use is a different one (see background).
From the standard:
According to the standard (p. 74, l. 22), the simdlen clause for the simd construct (as opposed to the declare simd construct) specifies the preferred behaviour, while the actual behaviour, and thus the answer to the original question, is implementation defined:
If used, the simdlen clause specifies the preferred number of iterations to be executed concurrently. The parameter of the simdlen clause must be a constant positive integer. The number of iterations that are executed concurrently at any given time is implementation defined.
The only constraints for the allowed value stated in the standard are:
The parameter of the safelen clause must be a constant positive integer expression.
If both simdlen and safelen clauses are specified, the value of the simdlen parameter must be less than or equal to the value of the safelen parameter.
Background:
The simdlen clause was added to the simd construct (see Section 2.8.1 on page 72) to support specification of the exact number of iterations desired per SIMD chunk.
This can be used to call a matching SIMD-function generated with the declare simd construct and a corresponding simdlen clause, where the latter has slightly different semantics:
If a SIMD version is created, the number of concurrent arguments for the function is determined by the simdlen clause.
Hope that helps.

Related

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably
How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

What are the AVX-512 Galois-field-related instructions for?

One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions".
Galois theory is about field extensions. What does that have to do with processing vectorized integer or floating-point values? The instructions supposedly perform "Galois field affine transformation", the inverse of that, and "Galois field multiply bytes".
What fields are those? What do these instructions actually do and what is it good for?
These instructions are closely related to the AES (Rijndael) block cipher. GF2P8AFFINEINVQB performs a Rijndael S-Box substitution with a user-defined affine transformation.
GF2P8AFFINEQB is essentially a (carry-less) multiplication of an 8x8 bit matrix with an 8-bit vector in GF(2), so it should be useful in other bit-oriented algorithms. It can also be used to convert between isomorphic representations of GF(28).
GF2P8MULB multiplies two (vectors of) elements of GF(28), actually 8-bit numbers in polynomial representation with the Rijndael reduction polynomial. This operation is used in Rijndael's MixColumns step.
Note that multiplication in finite fields is only loosely related to integer multiplication.
One of the major use-cases is I think SW RAID6 parity, including generating new parity on every write. (Not just during recovery / rebuild). RAID5 can use simple XOR parity for its one and only parity member of each stripe, but RAID6 needs two different parities that can recover N blocks of data from any N of the N+2 blocks of data+parity. This is forward error correction, a similar kind of problem that ECC solves.
Galois Fields are useful for this; they're the basis of widely-used Reed-Solomon codes, for example. e.g. Par2 uses 16-bit Galois Fields to allow very large block counts to generate relatively fine-grained error-recovery data for a large file or set of files. (Up to 64k blocks).
Unfortunately GFNI is not great for PAR2 because GFNI only supports GF2P8 GF(28), not the GF(216) that par2 uses. http://lab.jerasure.org/jerasure/gf-complete/issues/14 says it's possible to use GF2P8AFFINEQB to implement wider word sizes so it might be possible to speed up PAR2 with it.
But it should be useful for RAID6, including generating new parity on writes which is pretty CPU intensive. The Linux kernel's md driver already includes inline asm to use SSE2 or AVX2, one of the few uses of kernel_fpu_begin() and kernel_fpu_end(). (A 2013 paper looks at optimizing GF coding using Intel SIMD, mentioning Linux's md RAID and GF-Complete, the project linked earlier. The current state of the art is something like two pshufb byte shuffles to implement a 4-bit table lookup; GFNI could bring that down to 1 instruction especially if the hard-coded GF polynomial baked into gf2p8mulb is used.)
(RAID6 uses parity in a different way than par2, generating separate parity for each stripe "vertically" across disks, instead of "horizontally" for one big array of data. The underlying math is similar.)
Intel pretty probably plans to support GFNI on some future Silvermont-family Atom because there are legacy-SSE encodings of the instructions, without 3-operand VEX or EVEX. Many other new instructions are introduced with only VEX encodings, including some of the BMI1/BMI2 scalar integer instructions.
Silvermont-family (Airmont, Goldmont, Tremont, ...) gets some use in NAS appliances where most of the CPU demand could come from RAID6. A future version of it with GFNI could save power, or avoid bottlenecks without raising clock speed.
AVX + GFNI implies support for a YMM version (even without AVX2), and AVX512F + GFNI implies a ZMM version. (The HTML extract at felixcloutier.com strangely only mentions the non-VEX 128-bit encoding while also listing a _mm_maskz_gf2p8affine_epi64_epi8 intrinsic (masking requires EVEX). HJLebbink's HTML extract does include the VEX and EVEX forms. Maybe they only appear in Intel's "future extensions" manual which HJ scrapes but Felix doesn't.)
Using 512-bit vectors does limit turbo clock speeds for a short time after (on Skylake-Xeon), so it might not be desirable for the kernel to do that. But it could give a significant reduction in CPU overhead for some cases, if you're not memory-bound.
A "field" is a mathematical concept:
(wikipedia) In mathematics, a field is a set on which addition, subtraction, multiplication, and division are defined and behave as the corresponding operations on rational and real numbers do.
...
including the existence of an additive inverse −a for all elements a, and of a multiplicative inverse b−1 for every nonzero element b
Galois Fields are a kind of Finite Field which have this property: the bits in a GF8 number represent 0 or 1 coefficients of a polynomial of degree 8. (It's quite possible I totally butchered that, but it's something like that rather than place-value.) That's why carryless addition (aka XOR) and carryless multiplication (using shift/XOR instead of shift/add) is useful over Galois fields)
gf2p8mulb's baked-in polynomial of x^8 + x^4 + x^3 + x + 1 matches the one used in AES (Rijndael); this lends more weight to #nwellnhof's hypothesis that Intel just included it because the HW was there.
If it's also used in any other common application, it might give us a clue of the "intended" use case for these instructions.
There is a VAES extension that provides versions of AESENC and related instructions for YMM and ZMM vectors, up from just 128-bit vectors with AES-NI + AVX2. So Intel apparently is extending AES HW to 512-bit SIMD vectors. IDK if this motivates wide GFNI or vice versa, or some of both. (Wide GFNI makes a huge amount of sense; if it was limited to 128-bit, an optimized AVX512 implementation using vpshufb for lookup tables would beat it.)
To answer the purpose part, my guess is that these were added primarily for accelerating SM4 encryption, which shares similarity with AES in design.
This guess comes from the fact that ARM also added SM4 acceleration in ARMv8.4 at around the same time, suggesting that chipmakers want to accelerate this algorithm, probably because it'll gain significant traction in the Chinese market. Also, the fact that it's the only AVX512 extension added in Icelake which also has an SSE encoding, so that Tremont could support it, suggests that they intended it for networking/storage purposes.
GFNI is also quite useful in Reed Solomon coding for error correction (as mentioned by Peter above). It's directly applicable to any GF(28) implementation (such as this) and the affine instruction can be used for other field sizes and polynomials - in fact, it's the fastest technique I know of to do so on an Intel processor.
The affine instruction also has a bunch of out-of-band use cases, including 8-bit shifts and bit-permutes. It's equivalent to RISC-V's bmatxor instruction, where some use cases are listed here.
Some links describing use cases for this instruction.

OpenACC Scheduling

Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.
The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.
According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

Why use 1 instead of -1?

At 29min mark of http://channel9.msdn.com/Events/GoingNative/2013/Writing-Quick-Code-in-Cpp-Quickly Andrei Alexandrescu says when using constants to prefer 0 and mentions hardware knows how to handle it. I did some assembly and I know what he is talking about and about the zero flag on CPUs
Then he says prefer the constant 1 rather then -1. -1 IIRC is not actually special but because it is negative the sign flag on CPUs would be set. 1 from my current understanding is simply a positive number there is no bit on the processor flag for it and no way to distinguish from 0 or other positive numbers.
But Andrei says to prefer 1 over -1. Why? What does hardware do with 1 that is better then -1?
First, it should be noted that Andrea Alexandrescu emphasized the difference between zero and the other two good constants, that the difference between using one and negative one is less significant. He also bundles compiler issues with hardware issues, i.e., the hardware might be able to perform the operation efficiently but the compiler will not generate the appropriate machine code given a reasonably clear expression in the chosen higher level language.
While I cannot read his mind, there are at least two aspects that may make one better than negative one.
Many ISAs provide comparison operations (or flag to GPR transfers) that return zero or one (e.g., MIPS has Set on Less Than) not zero or negative one. (SIMD instructions are an exception; SIMD comparisons typically generate zero or negative one [all bits set].)
At least one implementation of SPARC made loading smaller signed values more expensive, and I seem to recall that at least one ISA did not provide an instruction for loading signed bytes. Naive implementation of sign extension adds latency because whether to set or clear the more significant bits is not known until the value has been loaded.
Negative one does have some benefits. As you mentioned, testing for negativity is often relatively easy, so if negative one is the only negative value used it may be handled less expensively. Also, conditionally clearing a value based on zero or negative one is simply an and operation. (For conditionally setting or clearing a single bit, one rather than negative one would be preferred since such would involve only a shift and an and.)

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

Resources