Why GLM_CONSTEXPR_SIMD? - performance

The OpenGL mathematics library defines a macro GLM_CONSTEXPR_SIMD which causes expressions like vec3(1.0f, 0.0f, 0.0f, 1.0f) to be a constexpr only when generating platform-independent code, i.e., only whenGLM_ARCH is GLM_ARCH_PURE.
I assume this is done for performance reasons, but why would making something non-constexpr increase performance? And how does SIMD play a role in the decision?

This is most probably related to the fact that SIMD intrinsics are not defined constexpr. When you generate platform-independent code, it does not use intrinsics and thus, it can be declared constexpr. However, as soon as you pinpoint a platform e.g. with SSE/AVX, in order to benefit from these SIMD functions, constexpr must be stripped away.
Additional info available at Constexpr and SSE intrinsics

Related

ARM-SVE: wrapping runtime sized register

In a generic SIMD library eve we were looking into supporting length agnostic sve
However, we cannot wrap a sizeless register into a struct to do some meta-programming around it.
struct foo {
svint8_t a;
};
Is there a way to do it? Either clang or gcc.
I found some talk of __sizeless_struct and some patches flying around but I think it didn't go anywhere.
I also found these gcc tests - no wrapping of a register in a struct.
No, unfortunately this isn't possible (at the time of writing). __sizeless_struct was an experimental feature that Arm added as part of the initial downstream implementation of the SVE ACLE in Clang. The main purpose was to allow tuple types like svfloat32x3_t to be defined directly in <arm_sve.h>. But the feature had complex, counter-trend semantics. It broke one of the fundamental rules of C++, which is that all class objects have a constant size, so it would have been an ongoing maintenance burden for upstream compilers.
__sizeless_struct (or something like it) probably wouldn't be acceptable for a portable SIMD framework, since the sizeless struct would inherit all of the restrictions of sizeless vector types: no global variables, no uses in normal structs, etc. Either all SIMD targets would have to live by those restrictions, or the restrictions would vary by target (limiting portability).
Function-based abstraction might be a better starting point than class-based abstraction for SIMD frameworks that want to support variable-length vectors. Google Highway is an example of this and it works well for SVE.

How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself?

I'm trying to implement a similar functionality to one of my projects and I was wondering how it works.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective? I'm referencing the compiler since I have read that #pragma macros are to give side information to the compiler.
If I take into account that all the macros are handled by the preprocessor it gets really confusing to me.
How is the macro expanded and how the OpenMP library gets access to the information in those macros? Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs, processor_name);
}
MPI_Finalize();
return 0;
}
I got this example from here.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective?
This is strongly dependent of the compiler implementation. In practice, for Clang and GCC (and probably ICC), the pragma annotation gives information to compilers steps enabling it to transform the code in a front-end pass. Put it simply, the front-end of a compiler is the one doing preprocessing, tokenization, syntactic analysis and semantic analysis, while the back-end does optimizations and code generation.
For most steps, mainstream compilers enable you to get the temporary output intermediate code. For example Clang and GCC have the -E flag for the preprocessor and -S for code generation. Low-level intermediate representation (IR) are more dependant to a compiler implementation so the flags are not the same (nor the optimizations and the intermediate language). GCC use a GENERIC/GIMPLE language for the high-level IR while Clang use the LLVM IR language. AFAIK, the GIMPLE code can be dump using the -fdump-* flags. For Clang, -emit-llvm can be used to dump the IR code.
In Clang, the transformation is done after the AST generation, but before the first IR generation. Note that some other compilers does an AST transformation, while some other do that in later steps. When OpenMP is enabled (with -fopenmp), Clang replaces the pragma region with an __kmpc_fork_call and generate a function for the region which is passed to KMP function. KMP is the prefix for the IOMP runtime shared by both Clang and ICC. GCC has its own runtime called GOMP. There are many other runtimes but the mainstream ones are GOMP and IOMP. Also note that GCC uses a similar strategy by calling GOMP_parallel with a generated function provided at runtime. The IOMP/GOMP runtimes take care of initializing the region and the ICV before calling the compiler-generated function.
Note that the processor is not aware of the use of OpenMP (at least not for all OpenMP implementations I am aware of).
How is the macro expanded and how the OpenMP library gets access to the information in those macros?
Note that pragma annotations are not macros, there are more powerful than that: they provide information to the compiler that can perform non trivial changes during any compilation steps. For example, a pragma can change the way the code generation is performed which is impossible with preprocessor macros (eg. #pragma GCC unroll n for loop unrolling in GCC and #pragma ivdep for telling ICC that there is no loop-carried dependencies enabling auto-vectorization).
The information are passed to the main runtime fork function as arguments (ie. __kmpc_fork_call and GOMP_parallel) like the compiler-generated user function.
Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
It is not just simple macros invocation and AFAIK there is no external module for GCC and Clang. They are directly integrated to the compiler (though it may be modular, especially for Clang). This is important because compilers need to analyse the pragma annotations at compile-time. The pragma are not just a way to automatically generate runtime calls and abstract them with a standard language/interface, they also impact the compiler steps. For example, #pragma omp simd should impact the auto-vectorization optimization steps of compilers (back-end steps).
AFAIK, there are some (research) OpenMP implementations based on a source-to-source compilation so to be compiler independent but I am not sure they supports all OpenMP features (especially SIMD ones).

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

GCC ARM SIMD intrinsics compiling to scalar instructions

I have a music synthesis app that runs on a RPi3 (Cortex-A53) in 32-bit mode, under a Yocto-based RTLinux. I'm using GCC 6.3 to compile the code, which uses tons of SIMD intrinsics in C++ to operate on float32x4_t and int32x4_t data. The code is instrumented so that I can see how long it takes to execute certain sizeable chunks of SIMD. It worked well until a couple days ago, when all of a sudden after fiddling unrelated stuff it slowed down by a factor of more than two.
I went in and looked at the code that was being generated. In the past, the code looked beautiful, very efficient. Now, it's not even using SIMD in most places. I checked the compiler options. They include -marm -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3. Occasionally you see a q register in the generated code, so it knows they exist, but mostly it operates on s registers. Furthermore, it uses lots of code to move pieces of q8-q15 (a.k.a. d16-d31) into general registers and then back into s0-s31 registers to operate on them, and then moves them back, which is horribly inefficient. Does anyone know any reason why the compiler should suddenly start compiling the float32x4_t and int32x4_t vector intrinsics into individual scalar ops? Or any way to diagnose this by getting the compiler to cough up some information about what's going on inside?
Edit: I found that in some places I was doing direct arithmetic on int32x4_t and float32x4_t types, while in other places I was using the ARM intrinsic functions. In the latter case, I was getting SIMD instructions but in the former it was using scalars. When I rewrote the code using all intrinsics, the SIMD instructions reappeared, and the execution time dropped close to what it was before. But I noticed that if I wrote something like x += y * z; the compiler would use scalars but was smart enough to use four VFMA instructions, while if I wrote x = vaddq_f32(x, vmulq_f32(y, z)); it would use VADDQ and VMULQ instructions. This explains why it isn't quite as fast as before when it was compiling arithmetic operators into SIMD.
So the question is now: Why was the compiler willing to compile direct arithmetic on int32x4_t and float32x4_t values into quad SIMD operations before, but not any more? Is there some obscure option that I didn't realize I had in there, and am now missing?

gcc, simd intrinsics and fast-math concepts

Hi all :)
I'm trying to get a hang on a few concepts regarding floating point, SIMD/math intrinsics and the fast-math flag for gcc. More specifically, I'm using MinGW with gcc v4.5.0 on a x86 cpu.
I've searched around for a while now, and that's what I (think I) understand at the moment:
When I compile with no flags, any fp code will be standard x87, no simd intrinsics, and the math.h functions will be linked from msvcrt.dll.
When I use mfpmath, mssen and/or march so that mmx/sse/avx code gets enabled, gcc actually uses simd instructions only if I also specify some optimization flags, like On or ftree-vectorize. In which case the intrinsics are chosen automagically by gcc, and some math functions (I'm still talking about the standard math funcs on math.h) will become intrinsics or optimized out by inline code, some others will still come from the msvcrt.dll.
If I don't specify optimization flags, does any of this change?
When I use specific simd data types (those available as gcc extensions, like v4si or v8qi), I have the option to call intrinsic funcs directly, or again leave the automagic decision to gcc. Gcc can still chose standard x87 code if I don't enable simd instructions via the proper flags.
Again, if I don't specify optimization flags, does any of this change?
Plese correct me if any of my statements is wrong :p
Now the questions:
Do I ever have to include x86intrin.h to use intrinsics?
Do I ever have to link the libm?
What fast-math has to do with anything? I understand it relaxes the IEEE standard, but, specifically, how? Other standard functions are used? Some other lib is linked? Or are just a couple of flags set somewhere and the standard lib behaves differently?
Thanks to anybody who is going to help :D
Ok, I'm ansewring for anyone who is struggling a bit to grasp these concepts like me.
Optimizations with Ox work on any kind of code, fpu or sse
fast-math seems to work only on x87 code. Also, it doesn't seem to change the fpu control word o_O
Builtins are always included. This behavior can be avoided for some builtins, with some flags, like strict or no-builtins.
The libm.a is used for some stuff that is not included in the glibc, but with mingw it's just a dummy file, so at the moment it's useless to link to it
Using the special vector types of gcc seems useful only when calling the intrinsics directly, otherwise the code gets vectorized anyway.
Any correction is welcomed :)
Useful links:
fpu / sse control
gcc math
and the gcc manual on "Vector Extensions", "X86 Built-in functions" and "Other Builtins"

Resources