How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself? - compilation

I'm trying to implement a similar functionality to one of my projects and I was wondering how it works.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective? I'm referencing the compiler since I have read that #pragma macros are to give side information to the compiler.
If I take into account that all the macros are handled by the preprocessor it gets really confusing to me.
How is the macro expanded and how the OpenMP library gets access to the information in those macros? Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs, processor_name);
}
MPI_Finalize();
return 0;
}
I got this example from here.

For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective?
This is strongly dependent of the compiler implementation. In practice, for Clang and GCC (and probably ICC), the pragma annotation gives information to compilers steps enabling it to transform the code in a front-end pass. Put it simply, the front-end of a compiler is the one doing preprocessing, tokenization, syntactic analysis and semantic analysis, while the back-end does optimizations and code generation.
For most steps, mainstream compilers enable you to get the temporary output intermediate code. For example Clang and GCC have the -E flag for the preprocessor and -S for code generation. Low-level intermediate representation (IR) are more dependant to a compiler implementation so the flags are not the same (nor the optimizations and the intermediate language). GCC use a GENERIC/GIMPLE language for the high-level IR while Clang use the LLVM IR language. AFAIK, the GIMPLE code can be dump using the -fdump-* flags. For Clang, -emit-llvm can be used to dump the IR code.
In Clang, the transformation is done after the AST generation, but before the first IR generation. Note that some other compilers does an AST transformation, while some other do that in later steps. When OpenMP is enabled (with -fopenmp), Clang replaces the pragma region with an __kmpc_fork_call and generate a function for the region which is passed to KMP function. KMP is the prefix for the IOMP runtime shared by both Clang and ICC. GCC has its own runtime called GOMP. There are many other runtimes but the mainstream ones are GOMP and IOMP. Also note that GCC uses a similar strategy by calling GOMP_parallel with a generated function provided at runtime. The IOMP/GOMP runtimes take care of initializing the region and the ICV before calling the compiler-generated function.
Note that the processor is not aware of the use of OpenMP (at least not for all OpenMP implementations I am aware of).
How is the macro expanded and how the OpenMP library gets access to the information in those macros?
Note that pragma annotations are not macros, there are more powerful than that: they provide information to the compiler that can perform non trivial changes during any compilation steps. For example, a pragma can change the way the code generation is performed which is impossible with preprocessor macros (eg. #pragma GCC unroll n for loop unrolling in GCC and #pragma ivdep for telling ICC that there is no loop-carried dependencies enabling auto-vectorization).
The information are passed to the main runtime fork function as arguments (ie. __kmpc_fork_call and GOMP_parallel) like the compiler-generated user function.
Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
It is not just simple macros invocation and AFAIK there is no external module for GCC and Clang. They are directly integrated to the compiler (though it may be modular, especially for Clang). This is important because compilers need to analyse the pragma annotations at compile-time. The pragma are not just a way to automatically generate runtime calls and abstract them with a standard language/interface, they also impact the compiler steps. For example, #pragma omp simd should impact the auto-vectorization optimization steps of compilers (back-end steps).
AFAIK, there are some (research) OpenMP implementations based on a source-to-source compilation so to be compiler independent but I am not sure they supports all OpenMP features (especially SIMD ones).

Related

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

best way to mix OpenACC with OpenMP

I have a code which is mainly developed using OpenACC, I would like to compare P100 results with OpenACC to Intel's KNL nodes using OpenMP, I tried using the compiler flag -ta=multi_core but it basically serialized all the loops (as per -acc info). Is the only way to use a preprocessor directive for all the loops? Are there any other more efficient or cleaner ways?
#ifndef _OPENACC
#pragma omp .....
#else
#pragma acc ......
#endif

GCC ARM SIMD intrinsics compiling to scalar instructions

I have a music synthesis app that runs on a RPi3 (Cortex-A53) in 32-bit mode, under a Yocto-based RTLinux. I'm using GCC 6.3 to compile the code, which uses tons of SIMD intrinsics in C++ to operate on float32x4_t and int32x4_t data. The code is instrumented so that I can see how long it takes to execute certain sizeable chunks of SIMD. It worked well until a couple days ago, when all of a sudden after fiddling unrelated stuff it slowed down by a factor of more than two.
I went in and looked at the code that was being generated. In the past, the code looked beautiful, very efficient. Now, it's not even using SIMD in most places. I checked the compiler options. They include -marm -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3. Occasionally you see a q register in the generated code, so it knows they exist, but mostly it operates on s registers. Furthermore, it uses lots of code to move pieces of q8-q15 (a.k.a. d16-d31) into general registers and then back into s0-s31 registers to operate on them, and then moves them back, which is horribly inefficient. Does anyone know any reason why the compiler should suddenly start compiling the float32x4_t and int32x4_t vector intrinsics into individual scalar ops? Or any way to diagnose this by getting the compiler to cough up some information about what's going on inside?
Edit: I found that in some places I was doing direct arithmetic on int32x4_t and float32x4_t types, while in other places I was using the ARM intrinsic functions. In the latter case, I was getting SIMD instructions but in the former it was using scalars. When I rewrote the code using all intrinsics, the SIMD instructions reappeared, and the execution time dropped close to what it was before. But I noticed that if I wrote something like x += y * z; the compiler would use scalars but was smart enough to use four VFMA instructions, while if I wrote x = vaddq_f32(x, vmulq_f32(y, z)); it would use VADDQ and VMULQ instructions. This explains why it isn't quite as fast as before when it was compiling arithmetic operators into SIMD.
So the question is now: Why was the compiler willing to compile direct arithmetic on int32x4_t and float32x4_t values into quad SIMD operations before, but not any more? Is there some obscure option that I didn't realize I had in there, and am now missing?

How is GCC IR different from LLVM IR?

Why do people prefer LLVM IR, and how exactly is it different from the GCC IR? Is target dependency a factor here?
I'm a complete newbie to compilers, and wasn't able to find anything relevant even after many hours of searching for an answer. Any insights would be helpful.
Firstly, as this answer touches on complex and sensitive topics I want to make few disclaimers:
I assume your question is about middle-end IRs of LLVM and GCC (as the term "LLVM IR" applies only to middle-end). Discussion of differences of back-end IRs (LLVM MachineIR and GCC RTL) and related codegen tools (LLVM Tablegen and GCC Machine Description) is an interesting and important topic but would make the answer several times larger.
I left out library-based design of LLVM vs monolithic design of GCC as this is separate from IR per se (although related).
I enjoy hacking on both GCC and LLVM and I do not put one ahead of other. LLVM is what it is because people could learn from things that GCC had wrong back in 2000-s (and which have been significantly improved since then).
I'm happy to improve this answer so please post comments if you think that something is imprecise or missing.
The most important fact is that LLVM IR and GCC IR (called GIMPLE) are not that different in their core - both are standard control-flow graphs of basic blocks, each block being a linear sequence of 2 inputs, 1 output instructions (so called "three-address code") which have been converted to SSA form. Most production compilers have been using this design since 1990-s.
Main advantages of LLVM IR are that it's less tightly bound to compiler implementation, more formally defined and has nicer C++ API. This allows for easier processing, transformation and analysis, which makes it IR of choice these days, both for compiler and for other related tools.
I expand on benefits of LLVM IR in subchapters below.
Standalone IR
LLVM IR originally designed to be fully reusable across arbitrary tools besides compiler itself. The original intent was to use it for multi-stage optimization: IR would be consequently optimized by ahead-of-time compiler, link-time optimizer and JIT compiler at runtime. This didn't work out but reusability had other important implications, most noticeably it allowed easy integration of other types of tools (static analyzers, instrumenters, etc.).
GCC community never had desire to enable any tools besides compiler (Richard Stallman resisted attempts to make IR more reusable to prevent third-party commercial tools from reusing GCC's frontends). Thus GIMPLE (GCC's IR) was never considered to be more than an implementation detail, in particular it doesn't provide a full description of compiled program (e.g. it lacks program's call graph, type definitions, stack offsets and alias information).
Flexible pipeline
The idea of reusability and making IR a standalone entity led to an important design consequence in LLVM: compilation passes can be run in any order which prevents complex inter-pass dependencies (all dependencies have to be made explicit via analysis passes) and enables easier experimentation with compilation pipeline e.g.
running strict IR verification checks after each pass
bisecting pipeline to find a minimal subset of passes which cause compiler crash
fuzzing order of passes
Better unit-testing support
Standalone IR allows LLVM to use IR-level unit tests which allows easy testing of optimization/analysis corner-cases. This is much harder to achieve through C/C++ snippets (as in GCC testsuite) and even when you manage, the generated IR will most likely change significantly in future versions of the compiler and the corner case that your test was intended for will no longer be covered.
Simple link-time optimization
Standalone IR enables easy combination of IR from separate translation units with a follow-up (whole program) optimization. This is not a complete replacement for link-time optimization (as it does not deal with scalability issues which arise in production software) but is often good enough for smaller programs (e.g. in embedded development or research projects).
Stricter IR definition
Although criticized by academia, LLVM IR has a much stricter semantics compared to GIMPLE. This simplifies implementation of various static analyzers e.g. IR Verifier.
No intermediate IRs
LLVM IR is generated directly by the frontend (Clang, llgo, etc.) and preserved throughout the whole middle-end. This means that all tools, optimizations and internal APIs only need to operate on single IR. The same is not true for GCC - even GIMPLE has three distinct variants:
high GIMPLE (includes lexical scopes, high-level control-flow constructs, etc.)
pre-SSA low GIMPLE
final SSA GIMPLE
and also GCC frontends typically generate intermediate GENERIC IR instead of GIMPLE.
Simpler IR
Compared to GIMPLE, LLVM IR was deliberately made simpler by reducing number of cases which IR consumers need to consider. I've added several examples below.
Explicit control-flow
All basic blocks in LLVM IR program have to end with explicit control-flow opcode (branch, goto, etc.). Implicit control flow (i.e. fallthrough) is not allowed.
Explicit stack allocations
In LLVM IR virtual registers do not have memory. Stack allocations are represented by dedicated alloca operations. This simplifies working with stack variables e.g. equivalent of GCC's ADDR_EXPR is not needed.
Explicit indexing operations
Contrary to GIMPLE which has plethora of opcodes for memory references (INDIRECT_REF, MEM_REF, ARRAY_REF, COMPONENT_REF, etc.), LLVM IR has only plain load and store opcodes and all complex arithmetic is moved to dedicated structured indexing opcode, getelementptr.
Garbage collection support
LLVM IR provides dedicated pseudo-instructions for garbage-collected languages.
Higher-level implementation language
While C++ may not be the best programming language, it definitely allows to write much simpler (and in many case more functional) system code,
especially with post-C++11 changes (LLVM aggressively adopts new Standards). Following LLVM, GCC has also adopted C++ but majority of the codebase is still written in C style.
There are too many instances where C++ enables a simpler code so I'll just name a few.
Explicit hierarchy
The hierarchy of operators in LLVM is implemented via standard inheritance and template-based custom RTTI. On the other hand GCC achieves the same via old-style inheritance-via-aggregation
// Base class which all operators aggregate
struct GTY(()) tree_base {
ENUM_BITFIELD(tree_code) code : 16;
unsigned side_effects_flag : 1;
unsigned constant_flag : 1;
unsigned addressable_flag : 1;
... // Many more fields
};
// Typed operators add type to base data
struct GTY(()) tree_typed {
struct tree_base base;
tree type;
};
// Constants add integer value to typed node data
struct GTY(()) tree_int_cst {
struct tree_typed typed;
HOST_WIDE_INT val[1];
};
// Complex numbers add real and imaginary components to typed data
struct GTY(()) tree_complex {
struct tree_typed typed;
tree real;
tree imag;
};
// Many more operators follow
...
and tagged union paradigms:
union GTY ((ptr_alias (union lang_tree_node),
desc ("tree_node_structure (&%h)"), variable_size)) tree_node {
struct tree_base GTY ((tag ("TS_BASE"))) base;
struct tree_typed GTY ((tag ("TS_TYPED"))) typed;
struct tree_int_cst GTY ((tag ("TS_INT_CST"))) int_cst;
struct tree_complex GTY ((tag ("TS_COMPLEX"))) complex;
All GCC operator APIs use the base tree type which is accessed via fat macro interface (DECL_NAME, TREE_IMAGPART, etc.). Interface is only verified at runtime (and only if GCC was configured with --enable-checking) and does not allow static checking.
More concise APIs
LLVM generally provides simpler APIs for pattern matching IR in optimizers. For example checking that instruction is an addition with constant in GCC looks like
if (gimple_assign_p (stmt)
&& gimple_assign_rhs_code (stmt) == PLUS_EXPR
&& TREE_CODE (gimple_assign_rhs2 (stmt)) == INTEGER_CST)
{
...
and in LLVM:
if (auto BO = dyn_cast<BinaryOperator>(V))
if (BO->getOpcode() == Instruction::Add
&& isa<ConstantInt>(BO->getOperand(1))
{
Arbitrary-precision arithmetic
Due to C++ support for overloading, LLVM can uses arbitrary-precision ints for all computations whereas GCC still uses physical integers (HOST_WIDE_INT type, which is 32-bit on 32-bit hosts):
if (!tree_fits_shwi_p (arg1))
return false;
*exponent = tree_to_shwi (arg1);
As shown in the example this can lead to missed optimizations.
GCC has got an equivalent of APInts few years ago but the majority of the codebase still uses HOST_WIDE_INT.

Intel's pragma simd vs OpenMP's pragma omp simd

The Intel compiler allows us to vectorize loops via
#pragma simd
for ( ... )
However, you also have the option to do this with OpenMP 4's directive:
#pragma omp simd
for ( ... )
Is there any difference between the two?
For all intents and purposes they should be identical. The difference is that the OpenMP 4.0 #pragma omp simd directive is portable and should work with other compilers that support OpenMP 4.0 as well as Intel's.
Furthemore, there are several clauses in the OpenMP version which allow you to vectorize instructions in a more robust manner (safelen(), linear(), aligned(), reduction(), and collapse() come to mind).
#pragma simd
- is positioned as a part of Intel C++ Cilk SIMD extensions (along with Array Notation). This positioning looks slightly strange, because Cilk is not for Fortran, while Fortran Compiler supports mostly identical directive simd.
#pragma omp simd
- is a part of OpenMP standard, so it's naturally more portable across Compiler and Platforms.
Normally Intel rolls out new capabilities in Cilk first and in OpenMP second. The reason is very natural: it takes time to accept something as part of OpenMP standard. For example simdlen has only been added in OpenMP4.5 in November 2015, while identical clause (vectorlength) has been part of Cilk pragma 3 or 4 years ago already. simdlen vs. vectorlength highlights another observation that some pragma clauses syntax may differ between simd and omp simd.
So, if you need portability across compilers, use OpenMP pragma. But if it's more important for you to get access to new simd compiler capabilities as early as possible, then you may prefer or optionally use Cilk (proprietary) pragma. All the same arguments and considerations are equally applicable to #pragma omp declare simd vs. #pragma declare simd (which should answer your potential second question).
Reference of "Cilk" pragma simd clauses is available here: https://software.intel.com/en-us/node/524555
(I think it's slightly out of date; I've heard about new capabilities of pragma simd not reflected in this link yet).

Resources