How is GCC IR different from LLVM IR?

How is GCC IR different from LLVM IR? - gcc

Why do people prefer LLVM IR, and how exactly is it different from the GCC IR? Is target dependency a factor here?
I'm a complete newbie to compilers, and wasn't able to find anything relevant even after many hours of searching for an answer. Any insights would be helpful.

Firstly, as this answer touches on complex and sensitive topics I want to make few disclaimers:
I assume your question is about middle-end IRs of LLVM and GCC (as the term "LLVM IR" applies only to middle-end). Discussion of differences of back-end IRs (LLVM MachineIR and GCC RTL) and related codegen tools (LLVM Tablegen and GCC Machine Description) is an interesting and important topic but would make the answer several times larger.
I left out library-based design of LLVM vs monolithic design of GCC as this is separate from IR per se (although related).
I enjoy hacking on both GCC and LLVM and I do not put one ahead of other. LLVM is what it is because people could learn from things that GCC had wrong back in 2000-s (and which have been significantly improved since then).
I'm happy to improve this answer so please post comments if you think that something is imprecise or missing.
The most important fact is that LLVM IR and GCC IR (called GIMPLE) are not that different in their core - both are standard control-flow graphs of basic blocks, each block being a linear sequence of 2 inputs, 1 output instructions (so called "three-address code") which have been converted to SSA form. Most production compilers have been using this design since 1990-s.
Main advantages of LLVM IR are that it's less tightly bound to compiler implementation, more formally defined and has nicer C++ API. This allows for easier processing, transformation and analysis, which makes it IR of choice these days, both for compiler and for other related tools.
I expand on benefits of LLVM IR in subchapters below.
Standalone IR
LLVM IR originally designed to be fully reusable across arbitrary tools besides compiler itself. The original intent was to use it for multi-stage optimization: IR would be consequently optimized by ahead-of-time compiler, link-time optimizer and JIT compiler at runtime. This didn't work out but reusability had other important implications, most noticeably it allowed easy integration of other types of tools (static analyzers, instrumenters, etc.).
GCC community never had desire to enable any tools besides compiler (Richard Stallman resisted attempts to make IR more reusable to prevent third-party commercial tools from reusing GCC's frontends). Thus GIMPLE (GCC's IR) was never considered to be more than an implementation detail, in particular it doesn't provide a full description of compiled program (e.g. it lacks program's call graph, type definitions, stack offsets and alias information).
Flexible pipeline
The idea of reusability and making IR a standalone entity led to an important design consequence in LLVM: compilation passes can be run in any order which prevents complex inter-pass dependencies (all dependencies have to be made explicit via analysis passes) and enables easier experimentation with compilation pipeline e.g.
running strict IR verification checks after each pass
bisecting pipeline to find a minimal subset of passes which cause compiler crash
fuzzing order of passes
Better unit-testing support
Standalone IR allows LLVM to use IR-level unit tests which allows easy testing of optimization/analysis corner-cases. This is much harder to achieve through C/C++ snippets (as in GCC testsuite) and even when you manage, the generated IR will most likely change significantly in future versions of the compiler and the corner case that your test was intended for will no longer be covered.
Simple link-time optimization
Standalone IR enables easy combination of IR from separate translation units with a follow-up (whole program) optimization. This is not a complete replacement for link-time optimization (as it does not deal with scalability issues which arise in production software) but is often good enough for smaller programs (e.g. in embedded development or research projects).
Stricter IR definition
Although criticized by academia, LLVM IR has a much stricter semantics compared to GIMPLE. This simplifies implementation of various static analyzers e.g. IR Verifier.
No intermediate IRs
LLVM IR is generated directly by the frontend (Clang, llgo, etc.) and preserved throughout the whole middle-end. This means that all tools, optimizations and internal APIs only need to operate on single IR. The same is not true for GCC - even GIMPLE has three distinct variants:
high GIMPLE (includes lexical scopes, high-level control-flow constructs, etc.)
pre-SSA low GIMPLE
final SSA GIMPLE
and also GCC frontends typically generate intermediate GENERIC IR instead of GIMPLE.
Simpler IR
Compared to GIMPLE, LLVM IR was deliberately made simpler by reducing number of cases which IR consumers need to consider. I've added several examples below.
Explicit control-flow
All basic blocks in LLVM IR program have to end with explicit control-flow opcode (branch, goto, etc.). Implicit control flow (i.e. fallthrough) is not allowed.
Explicit stack allocations
In LLVM IR virtual registers do not have memory. Stack allocations are represented by dedicated alloca operations. This simplifies working with stack variables e.g. equivalent of GCC's ADDR_EXPR is not needed.
Explicit indexing operations
Contrary to GIMPLE which has plethora of opcodes for memory references (INDIRECT_REF, MEM_REF, ARRAY_REF, COMPONENT_REF, etc.), LLVM IR has only plain load and store opcodes and all complex arithmetic is moved to dedicated structured indexing opcode, getelementptr.
Garbage collection support
LLVM IR provides dedicated pseudo-instructions for garbage-collected languages.
Higher-level implementation language
While C++ may not be the best programming language, it definitely allows to write much simpler (and in many case more functional) system code,
especially with post-C++11 changes (LLVM aggressively adopts new Standards). Following LLVM, GCC has also adopted C++ but majority of the codebase is still written in C style.
There are too many instances where C++ enables a simpler code so I'll just name a few.
Explicit hierarchy
The hierarchy of operators in LLVM is implemented via standard inheritance and template-based custom RTTI. On the other hand GCC achieves the same via old-style inheritance-via-aggregation
// Base class which all operators aggregate
struct GTY(()) tree_base {
ENUM_BITFIELD(tree_code) code : 16;
unsigned side_effects_flag : 1;
unsigned constant_flag : 1;
unsigned addressable_flag : 1;
... // Many more fields
};
// Typed operators add type to base data
struct GTY(()) tree_typed {
struct tree_base base;
tree type;
};
// Constants add integer value to typed node data
struct GTY(()) tree_int_cst {
struct tree_typed typed;
HOST_WIDE_INT val[1];
};
// Complex numbers add real and imaginary components to typed data
struct GTY(()) tree_complex {
struct tree_typed typed;
tree real;
tree imag;
};
// Many more operators follow
...
and tagged union paradigms:
union GTY ((ptr_alias (union lang_tree_node),
desc ("tree_node_structure (&%h)"), variable_size)) tree_node {
struct tree_base GTY ((tag ("TS_BASE"))) base;
struct tree_typed GTY ((tag ("TS_TYPED"))) typed;
struct tree_int_cst GTY ((tag ("TS_INT_CST"))) int_cst;
struct tree_complex GTY ((tag ("TS_COMPLEX"))) complex;
All GCC operator APIs use the base tree type which is accessed via fat macro interface (DECL_NAME, TREE_IMAGPART, etc.). Interface is only verified at runtime (and only if GCC was configured with --enable-checking) and does not allow static checking.
More concise APIs
LLVM generally provides simpler APIs for pattern matching IR in optimizers. For example checking that instruction is an addition with constant in GCC looks like
if (gimple_assign_p (stmt)
&& gimple_assign_rhs_code (stmt) == PLUS_EXPR
&& TREE_CODE (gimple_assign_rhs2 (stmt)) == INTEGER_CST)
{
...
and in LLVM:
if (auto BO = dyn_cast<BinaryOperator>(V))
if (BO->getOpcode() == Instruction::Add
&& isa<ConstantInt>(BO->getOperand(1))
{
Arbitrary-precision arithmetic
Due to C++ support for overloading, LLVM can uses arbitrary-precision ints for all computations whereas GCC still uses physical integers (HOST_WIDE_INT type, which is 32-bit on 32-bit hosts):
if (!tree_fits_shwi_p (arg1))
return false;
*exponent = tree_to_shwi (arg1);
As shown in the example this can lead to missed optimizations.
GCC has got an equivalent of APInts few years ago but the majority of the codebase still uses HOST_WIDE_INT.

Related

ARM-SVE: wrapping runtime sized register

In a generic SIMD library eve we were looking into supporting length agnostic sve
However, we cannot wrap a sizeless register into a struct to do some meta-programming around it.
struct foo {
svint8_t a;
};
Is there a way to do it? Either clang or gcc.
I found some talk of __sizeless_struct and some patches flying around but I think it didn't go anywhere.
I also found these gcc tests - no wrapping of a register in a struct.

No, unfortunately this isn't possible (at the time of writing). __sizeless_struct was an experimental feature that Arm added as part of the initial downstream implementation of the SVE ACLE in Clang. The main purpose was to allow tuple types like svfloat32x3_t to be defined directly in <arm_sve.h>. But the feature had complex, counter-trend semantics. It broke one of the fundamental rules of C++, which is that all class objects have a constant size, so it would have been an ongoing maintenance burden for upstream compilers.
__sizeless_struct (or something like it) probably wouldn't be acceptable for a portable SIMD framework, since the sizeless struct would inherit all of the restrictions of sizeless vector types: no global variables, no uses in normal structs, etc. Either all SIMD targets would have to live by those restrictions, or the restrictions would vary by target (limiting portability).
Function-based abstraction might be a better starting point than class-based abstraction for SIMD frameworks that want to support variable-length vectors. Google Highway is an example of this and it works well for SVE.

ARM softfp vs hardfp performance

I have an ARM based platform with a Linux OS. Even though its gcc-based toolchain supports both hardfp and softfp, the vendor recommends using softfp and the platform is shipped with a set of standard and platform-related libraries which have only softfp version.
I'm making a computation-intensive (NEON) AI code based on OpenCV and tensorflow lite. Following the vendor guide, I have built these with softfp option. However, I have a feeling that my code is underperformed compared to other somewhat alike hardfp platforms.
Does the code performance depend on softfp/hardfp setting? Do I understand it right that all .o and .a files the compiler makes to build my program are also using softfp convention, which is less effective? If it does, are there any tricky ways to use hardfp calling convention internally but softfp for external libraries?

Normally, all objects that are linked together need to have the same float ABI. So if you need to use this softfp only library, i'm afraid you have to compile your own software in softfp too.
I had the same question about mixing ABIs. See here
Regarding the performance: the performance lost with softfp compared to hardfp is that you will pass (floating point) function parameters through usual registers instead of using FPU registers. This requires some additional copy between registers. As old_timer said it is impossible to evaluate the performance lost. If you have a single huge function with many float operations, the performance will be the same. If you have many small function calls with many floating variables and few operations, the performance will be dramatically slower.

The softfp option only affects the parameter passing.
In other words, unless you are passing lots of float type arguments while calling functions, there won't be any measurable performance hit compared to hardfp.
And since well designed projects heavily rely on passing pointer to structures instead of many single values, I would stick to softfp.

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?

The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

Default GCC optimization options for a specific architecture

our compilers course features exercises asking us to compare code built with the -O and -O3 gcc options. The code generated by my machine isn't the same as the code in the course. Is there a way to figure the optimization options used in the course, in order to obtain the same code on my machine, and make more meaningful observations?
I found how to get the optimization options on my machine :
$ gcc -O3 -Q --help=optimizer
But is there a way to deduce those on the machine of the professor except by trying them all and modifying them one by one (.ident "GCC: (Debian 4.3.2-1.1) 4.3.2")?
Thanks for your attention.
Edit:
I noticed that the code generated on my machine lacks the prologue and epilogue generated on my professor's. Is there an option to force prologue generation (google doesn't seem to bring much)?

Here's what you need to know about compiler optimizations : they are architecture dependent. Also, they're mainly different from one version of the compiler to another (gcc-4.9 does more stuff by default than gcc-4.4).
By architecture, I mean CPU micro architecture (Intel : Nehalem, Sandy bridge, Ivy Bridge, Haswell, KNC ... AMD : Bobcat, Bulldozzer, Jaguar, ...). Compilers usually convert input code (C, C++, ADA, ...) into a CPU-agnostic intermediary representation (GIMPLE for GCC) on which a large number of optimizations will be performed. After that, the compiler will generate a lower level representation closer to assembly. On the latter, architecture specific optimizations will be unrolled. Such optimizations include the choice of instructions with the lowest latencies, determining loop unroll factors depending on the loop size, the instruction cache size, and so on.
Since your generated code is different from the one you got in class, I suppose the underlying architectures must be different. In this case, even with the same compiler flags you won't be able to get the same assembly code (even with no optimizations you'll get different assembly codes).
For that, you should concentrate on comparing the optimized and non-optimized codes rather than trying to stick to what you were given in class. I even think that it's a great reverse engineering exercise to compare your optimized code to the one you were given.
You can find one of my earlier posts about compiler optimizations in here.
Two great books on the subject are The Dragon Book (Compilers: Principles, Techniques, and Tools) by Aho, Seti, and Ulman, and also Engineering a Compiler by Keith Cooper, and Linda Torczon.

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?

Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.

The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain

Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.

Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio