Default GCC optimization options for a specific architecture

Default GCC optimization options for a specific architecture - gcc

our compilers course features exercises asking us to compare code built with the -O and -O3 gcc options. The code generated by my machine isn't the same as the code in the course. Is there a way to figure the optimization options used in the course, in order to obtain the same code on my machine, and make more meaningful observations?
I found how to get the optimization options on my machine :
$ gcc -O3 -Q --help=optimizer
But is there a way to deduce those on the machine of the professor except by trying them all and modifying them one by one (.ident "GCC: (Debian 4.3.2-1.1) 4.3.2")?
Thanks for your attention.
Edit:
I noticed that the code generated on my machine lacks the prologue and epilogue generated on my professor's. Is there an option to force prologue generation (google doesn't seem to bring much)?

Here's what you need to know about compiler optimizations : they are architecture dependent. Also, they're mainly different from one version of the compiler to another (gcc-4.9 does more stuff by default than gcc-4.4).
By architecture, I mean CPU micro architecture (Intel : Nehalem, Sandy bridge, Ivy Bridge, Haswell, KNC ... AMD : Bobcat, Bulldozzer, Jaguar, ...). Compilers usually convert input code (C, C++, ADA, ...) into a CPU-agnostic intermediary representation (GIMPLE for GCC) on which a large number of optimizations will be performed. After that, the compiler will generate a lower level representation closer to assembly. On the latter, architecture specific optimizations will be unrolled. Such optimizations include the choice of instructions with the lowest latencies, determining loop unroll factors depending on the loop size, the instruction cache size, and so on.
Since your generated code is different from the one you got in class, I suppose the underlying architectures must be different. In this case, even with the same compiler flags you won't be able to get the same assembly code (even with no optimizations you'll get different assembly codes).
For that, you should concentrate on comparing the optimized and non-optimized codes rather than trying to stick to what you were given in class. I even think that it's a great reverse engineering exercise to compare your optimized code to the one you were given.
You can find one of my earlier posts about compiler optimizations in here.
Two great books on the subject are The Dragon Book (Compilers: Principles, Techniques, and Tools) by Aho, Seti, and Ulman, and also Engineering a Compiler by Keith Cooper, and Linda Torczon.

Related

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?

The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?

Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.

The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain

Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.

Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

Assembly language and compiled languages

How is assembly faster than compiled languages if both are translated to machine code?
I'm talking about truly compiled languages which are translated to machine code. Not C# or Java which are compiled to an intermediate language first and then compiled to native code by a software interpreter, etc.
On Wikipedia, I found something which I'm not sure if it's in any way related to this. Is it because that translation from a higher level language generates extra machine code? Or is my understanding wrong?
A utility program called an assembler is used to translate assembly language statements into the target computer's machine code. The assembler performs a more or less isomorphic translation (a one-to-one mapping) from mnemonic statements into machine instructions and data. This is in contrast with high-level languages, in which a single statement generally results in many machine instructions.

Well, it relates a bit to your question, indeed. The point is that compilers produce inefficient machine code at times for various reasons, such as not being able to completely analyze your code, inserting automatic range checks, automatic checks for objects being null, etc.
On the other hand if you write assembler code by hand and know what you're doing, then you can probably write some things much more efficient than the compiler, although the compiler's behavior may be tweaked and you can usually tell it not to do range checking, for example.
Most people, however, will not write better assembler code than a compiler, simply because compilers are written by people who know a good deal of really weird but really cool optimizations. Also things like loop unrolling are usually a pain to write yourself and make the resulting code faster in many cases.
While it's generally true that everything that a computer executes is machine code, the code that runs differs greatly depending on how many abstraction levels you put between the machine and the programmer. For Assembler that's one level, for Java there are a few more ...
Also many people mistakenly believe that certain optimizations at a higher abstraction layer pay off at a lower one. This is not necessarily the case and the compiler may just have trouble understanding what you are trying to do and fail to properly optimize it.

Assembly may sometimes be faster than a compiled language if an assembly programmer writes better assembly than that generated by the compiler.
A compiled language is often faster than assembly because programmers who write compilers usually know the CPU architecture better than programmers who are utilizing assembly in a one-off, limited-case, situation.

An assembly expert may be able to write assembly code that is more effective (fewer instructions, more efficient instructions, SIMD, ...) than what a compiler generates automatically.
However, most of the time, you're better off trusting the optimizer of your compiler.
Learn what your compiler does. Then let the compiler do it.

My standard answer when questions about assembly vs. high-level come up is to take a look at Michael Abrash's Graphics Programming Black Book.
The first couple of chapters give a good idea of what you can optimise effectively using assembly, and what you can't.
You can download it from GameDev - Jeff's links seem to be broken now unfortunately.

All good answers. My only additional point is that programmers tend to write a certain number of lines of code per day, regardless of language. Since the advantage of a high-level language is that it lets you get more done with less code, it takes incredible programmer discipline to actually write less code.
This is especially an issue for performance because it matters almost nowhere except in a tiny part of the code. It only matters in your hotspots - code that you write (1) consuming a significant fraction of execution time (2) without calling functions (3).

First of all, compilers generate very good (fast) assembly code.
It's true that compilers can add extra code since high order languages have mechanisms, like virtual methods and exceptions in C++. Thus the compiler will have to produce more code. There are cases where raw assembly could speed up the code but that's rare nowdays.

First - assembler should be used only in small code pieces, which eat most of the CPU time in a program - some kind of calculations for example - in the "bottle neck" of algorithm.
Secondly - it depends on experience in ASM of those who implements the same code in Assembler. If the assembler implementation of "bottle neck" code will be faster. If experience is low - it will be slower. And it will contain a lot of bugs. If experience is high enough - ASM will give significant profit.

How is assembly faster than compiled languages if both are translated to machine code?
The implicit assumption is hand-written assembly code. Of course, most compilers (e.g. GCC for C, C++, Fortran, Go, D etc...) are generating some assembler code; for example you might compile your foo.cc C++ source code with g++ -fverbose-asm -Wall -S -O2 -march=native foo.cc and look into the generated foo.s assembler code.
However, efficient assembler code is so difficult to write that, today, compilers can optimize better than human do. See this.
So practically speaking, it is not worth coding in assembler (also, take into account that development efforts cost very often much more than the hardware running the compiled code). Even when performance matters a lot and is worth spending a lot of money, it is better to hand-code only very few routines in assembler, or even to embed some assembler code in some of your C routines.
Look into the CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

GCC: Inline assembly - good for?

So I just found out GCC could do inline assembly and I was wondering two things:
What's the benefit of being able to inline assembly?
Is it possible to use GCC as an assembly compiler/assembler to learn assembly?
I've found a couple articles but they are all oldish, 2000 and 2001, not really sure of their relevance.
Thanks

The benefit of inline assembly is to have the assembly code, inlined (wait wait, don't kill me). By doing this, you don't have to worry about calling conventions, and you have much more control of the final object file (meaning you can decide where each variable goes- to which register or if it's memory stored), because that code won't be optimized (assuming you use the volatile keyword).
Regarding your second question, yes, it's possible. What you can do is write simple C programs, and then translate them to assembly, using
gcc -S source.c
With this, and the architecture manuals (MIPS, Intel, etc) as well as the GCC manual, you can go a long way.
There's some material online.
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/
The downside of inline assembly, is that usually your code will not be portable between different compilers.
Hope it helps.

Inline Assembly is useful for in-place optimizations, and access to CPU features not exposed by any libraries or the operating system.
For example, some applications need strict tracking of timing. On x86 systems, the RDTSC assembly command can be used to read the internal CPU timer.
Time Stamp Counter - Wikipedia
Using GCC or any C/C++ compiler with inline assembly is useful for small snippets of code, but many environments do not have good debugging support- which will be more important when developing projects where inline assembly provides specific functionality. Also, portability will become a recurring issue if you use inline assembly. It is preferable to create specific items in a suitable environment (GNU assembler, MASM) and import them projects as needed.

Inline assembly is generally used to access hardware features not otherwise exposed by the compiler (e.g. vector SIMD instructions where no intrinsics are provided), and/or for hand-optimizing performance critical sections of code where the compiler generates suboptimal code.
Certainly there is nothing to stop you using the inline assembler to test routines you have written in assembly language; however, if you intend to write large sections of code you are better off using a real assembler to avoid getting bogged down with irrelevancies. You will likely find the GNU assembler got installed along with the rest of the toolchain ;)

The benefit of embedding custom assembly code is that sometimes (dare I say, often times) a developer can write more efficient assembly code than a compiler can. So for extremely performance intensive items, custom written assembly might be beneficial. Games tend to come to mind....
As far as using it to learn assembly, I have no doubt that you could. But, I imagine that using an actual assembly SDK might be a better choice. Aside from the standard experimentation of learning how to use the language, you'd probably want the knowledge around setting up a development environment.

You should not learn assembly language by using the inline asm feature.
Regarding what it's good for, I agree with jldupont, mostly obfuscation. In theory, it allows you to easily integrate with the compiler, because the complex syntax of extended asm allows you to cooperate with the compiler on register usage, and it allows you to tell the compiler that you want this and that to be loaded from memory and placed in registers for you, and finally, it allows the compiler to be warned that you have clobbered this register or that one.
However, all of that could have been done by simply writing standard-conforming C code and then writing an assembler module, and calling the extension as a normal function. Perhaps ages ago the procedure call machine op was too slow to tolerate, but you won't notice today.
I believe the real answer is that it is easier, once you know the contraint DSL. People just throw in an asm and obfuscate the C program rather than go to the trouble of modifying the Makefile and adding a new module to the build and deploy workflow.

This isn't really an answer, but kind of an extended comment on other peoples' answers.
Inline assembly is still used to access CPU features. For instance, in the ARM chips used in cell phones, different manufacturers distinguish their offerings via special features that require unusual machine language instructions that would have no equivalent in C/C++.
Back in the 80s and early 90s, I used inline assembly a lot for optimizing loops. For instance, C compilers targeting 680x0 processors back then would do really stupid things, like:
calculate a value and put it in data register D1
PUSH D1, A7 # Put the value from D1 onto the stack in RAM
POP D1, A7 # Pop it back off again
do something else with the value in D1
But I haven't needed to do that in, oh, probably fifteen years, because modern compilers are much smarter. In fact, current compilers will sometimes generate more efficient code than most humans would. Especially given CPUs with long pipelines, branch prediction, and so on, the fastest-executing sequence of instructions is not always the one that would make most sense to a human. So you can say, "Do A B C D in that order", and the compiler will scramble the order all around for greater efficiency.
Playing a little with inline assembly is fine for starters, but if you're serious, I echo those who suggest you move to a "real" assembler after a while.

Manual optimization of loops that are executed a lot. This article is old, but can give you an idea about the kinds of optimizations hand-coded assembly is used for.
You can also use the assembler gcc uses directly. It's called as (see man as). However, many books and articles on assembly assume you are using a DOS or Windows environment. So it might be kind of hard to learn on Linux (maybe running FreeDOS on a virtual machine), because you not only need to know the processor (you can usually download the official manuals) you code for but also how hook to into the OS you are running.
A nice beginner book using DOS is the one by Norton and Socha. It's pretty old (the 3rd and latest edition is from 1992), so you can get used copies for like $0.01 (no joke). The only book I know of that is specific to Linux is the free "Programming from the Ground Up"

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?

Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.

Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler

I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!

If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio