Compiling in GCC: Is -O3 harmful? - gcc

I have heard that one should not compile with -O3 option with gcc. Is that true? If so, what are the reasons for avoiding -O3?

The answer is: it depends on your code.
The basic rule of thumb is like this:
At -O1 the compiler does optimizations that don't take too long to compute.
At -O2 the compiler does "expensive" optimizations that may slow the compile process. They might also make the output program a little larger, but probably not so much.
-Os is roughly the same as -O2, but the optimizations are tuned more towards size than speed. For the most part these two features don't conflict (more optimal code does less steps and is therefore smaller), but there are some tricks that duplicate code to avoid branching penalties, for example.
At -O3 the compiler really cranks up the space-hungry optimizations. It will inline functions much more aggressively, and try to use vectorization where possible.
You can read more details in the GCC documentation. If you really want to super optimize your code then you can try to enable even more options not used even at -O3; the -floop-* options, for instance`.
The problem with speed-space optimizations, in particular, is that they can have a negative impact on the effectiveness of your memory caches. The code might be better for the CPU, but if it's not better for your memory, then you lose. For this reason, if your program doesn't have a single hot-spot where it spends all it's time then you might find it is slowed down overall.
Real-world optimization is a imprecise science for three reasons:
User's hardware varies a lot.
What's good for one code base might not be good for another.
We want the compiler to run quickly, so it must make best guesses, rather than trying all the options and picking the best.
Basically, the answer is always, if performance matters try all the optimization levels, measure how well your code performs, and choose the best one for you. And do this again every something big changes.
If performance does not matter, -O2 is the choice for you.

Related

Performance overhead with "-g" (debug) flag of GCC? [duplicate]

I'm compiling a program with -O3 for performance and -g for debug symbols (in case of crash I can use the core dump). One thing bothers me a lot, does the -g option results in a performance penalty? When I look on the output of the compilation with and without -g, I see that the output without -g is 80% smaller than the output of the compilation with -g. If the extra space goes for the debug symbols, I don't care about it (I guess) since this part is not used during runtime. But if for each instruction in the compilation output without -g I need to do 4 more instructions in the compilation output with -g than I certainly prefer to stop using -g option even at the cost of not being able to process core dumps.
How to know the size of the debug symbols section inside the program and in general does compilation with -g creates a program which runs slower than the same code compiled without -g?
Citing from the gcc documentation
GCC allows you to use -g with -O. The shortcuts taken by optimized
code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values are already at hand;
some statements may execute in different places because they have been
moved out of loops.
that means:
I will insert debugging symbols for you but I won't try to retain them if an optimization pass screws them out, you'll have to deal with that
Debugging symbols aren't written into the code but into another section called "debug section" which isn't even loaded at runtime (only by a debugger). That means: no code changes. You shouldn't notice any performance difference in code execution speed but you might experience some slowness if the loader needs to deal with the larger binary or if it takes into account the increased binary size somehow. You will probably have to benchmark the app yourself to be 100% sure in your specific case.
Notice that there's also another option from gcc 4.8:
-Og
Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.
This flag will impact performance because it will disable any optimization pass that would interfere with debugging infos.
Finally, it might even happen that some optimizations are better suited to a specific architecture rather than another one and unless instructed to do so for your specific processor (see march/mtune options for your architecture), in O3 gcc will do its best for a generic architecture. That means you might even experience O3 being slower than O2 in some contrived scenarios. "Best-effort" doesn't always mean "the best available".

GCC optimization levels. Which is better?

I am focusing on the CPU/memory consumption of compiled programs by GCC.
Executing code compiled with O3 is it always so greedy in term of resources ?
Is there any scientific reference or specification that shows the difference of Mem/cpu consumption of different levels?
People working on this problem often focus on the impact of these optimizations on the execution time, compiled code size, energy. However, I can't find too much work talking about resource consumption (by enabling optimizations).
Thanks in advance.
No, there is no absolute way, because optimization in compilers is an art (and is even not well defined, and might be undecidable or intractable).
But some guidelines first:
be sure that your program is correct and has no bugs before optimizing anything, so do debug and test your program
have well designed test cases and representative benchmarks (see this).
be sure that your program has no undefined behavior (and this is tricky, see this), since GCC will optimize strangely (but very often correctly, according to C99 or C11 standards) if you have UB in your code; use the -fsanitize=style options (and gdb and valgrind ....) during debugging phase.
profile your code (on various benchmarks), in particular to find out what parts are worth optimization efforts; often (but not always) most of the CPU time happens in a small fraction of the code (rule of thumb: 80% of time spent in 20% of code; on some applications like the gcc compiler this is not true, check with gcc -ftime-report to ask gcc to show time spent in various compiler modules).... Most of the time "premature optimization is the root of all evil" (but there are exceptions to this aphorism).
improve your source code (e.g. use carefully and correctly restrict and const, add some pragmas or function or variable attributes, perhaps use wisely some GCC builtins __builtin_expect, __builtin_prefetch -see this-, __builtin_unreachable...)
use a recent compiler. Current version (october 2015) of GCC is 5.2 (and GCC 8 in june 2018) and continuous progress on optimization is made ; you might consider compiling GCC from its source code to have a recent version.
enable all warnings (gcc -Wall -Wextra) in the compiler, and try hard to avoid all of them; some warnings may appear only when you ask for optimization (e.g. with -O2)
Usually, compile with -O2 -march=native (or perhaps -mtune=native, I assume that you are not cross-compiling, if you do add the good -march option ...) and benchmark your program with that
Consider link-time optimization by compiling and linking with -flto and the same optimization flags. E.g., put CC= gcc -flto -O2 -march=native in your Makefile (then remove -O2 -mtune=native from your CFLAGS there)...
Try also -O3 -march=native, usually (but not always, you might sometimes has slightly faster code with -O2 than with -O3 but this is uncommon) you might get a tiny improvement over -O2
If you want to optimize the generated program size, use -Os instead of -O2 or -O3; more generally, don't forget to read the section Options That Control Optimization of the documentation. I guess that both -O2 and -Os would optimize the stack usage (which is very related to memory consumption). And some GCC optimizations are able to avoid malloc (which is related to heap memory consumption).
you might consider profile-guided optimizations, -fprofile-generate, -fprofile-use, -fauto-profile options
dive into the documentation of GCC, it has numerous optimization & code generation arguments (e.g. -ffast-math, -Ofast ...) and parameters and you could spend months trying some more of them; beware that some of them are not strictly C standard conforming!
recent GCC and Clang can emit DWARF debug information (somehow "approximate" if strong optimizations have been applied) even when optimizing, so passing both -O2 and -g could be worthwhile (you still would be able, with some pain, to use the gdb debugger on optimized executable)
if you have a lot of time to spend (weeks or months), you might customize GCC using MELT (or some other plugin) to add your own new (application-specific) optimization passes; but this is difficult (you'll need to understand GCC internal representations and organization) and probably rarely worthwhile, except in very specific cases (those when you can justify spending months of your time for improving optimization)
you might want to understand the stack usage of your program, so use -fstack-usage
you might want to understand the emitted assembler code, use -S -fverbose-asm in addition of optimization flags (and look into the produced .s assembler file)
you might want to understand the internal working of GCC, use various -fdump-* flags (you'll get hundred of dump files!).
Of course the above todo list should be used in an iterative and agile fashion.
For memory leaks bugs, consider valgrind and several -fsanitize= debugging options. Read also about garbage collection (and the GC handbook), notably Boehm's conservative garbage collector, and about compile-time garbage collection techniques.
Read about the MILEPOST project in GCC.
Consider also OpenMP, OpenCL, MPI, multi-threading, etc... Notice that parallelization is a difficult art.
Notice that even GCC developers are often unable to predict the effect (on CPU time of the produced binary) of such and such optimization. Somehow optimization is a black art.
Perhaps gcc-help#gcc.gnu.org might be a good place to ask more specific & precise and focused questions about optimizations in GCC
You could also contact me on basileatstarynkevitchdotnet with a more focused question... (and mention the URL of your original question)
For scientific papers on optimizations, you'll find lots of them. Start with ACM TOPLAS, ACM TACO etc... Search for iterative compiler optimization etc.... And define better what resources you want to optimize for (memory consumption means next to nothing....).

what is -o3 optimization flag used in gcc

Can someone tell me the simple question why we need to put -o3 associates with -o in gcc to compile c program, simply it will help to increase the speed of compilation or reduced the time lapsed spending in compilation?
Thanks!!!
It can potentially increase the performance of the generated code.
In principle, compilation usually takes longer because this requires (much) more analysis by the compiler.
For typical modern C++ code, the effect of -O2 and higher can be very dramatic (an order of magnitude, depending on the nature of the program).
Precisely which optimizations are performed at the various optimization levels is documented in the manual pages: http://linux.die.net/man/1/gcc
Keep in mind, though that the various optimizations can potentially make latent bugs manifest, because compilers are allowed to exploit Undefined Behaviour¹ to achieve more efficient target code.
Undefined Behaviour lurks in places where the language standard(s) do not specify exactly what needs to happen. These can be extremely subtle.
So I recommend on using anything higher than -O2 unless you have rigid quality controls in place that guard against such hidden undefined behaviours (think of valgrind/purify, statical analysis tools and (stress) testing in general).
¹ A very profound blog about undefined behaviour in optimizing compilers is here: http://blog.regehr.org/archives/213 . In particular it let's you take the perspective of a compiler writer, whose only objective is to generate the fastest possible code that still satisfies the specifications.

Is optimization dangerous? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've read somewhere in the comments section of one of the questions here in stack overflow that:
Always start coding unoptimized.
If it meets the requirements then it's good,
else code an optimized version.
Check if the optimized code meets the requirements, if it meets the requirement, keep it but also keep the unoptimized version or paste the unoptimized version as a comment.
if the optimized version doesn't meet the requirements, delete it and stick with the unoptimized.
^Is there a term for this kind of programming? Is this a good or bad programming practice
Is optimization dangerous? The only reason I can think of is that it can create unnecessary complexity which can lead to errors. Is there anything else?
Is there a general rule to be followed about when one should optimize or not?
Optimising code takes time from the developers that they could instead use to add new features or polish their product. Since the end goal of development is not the code but the product that is build with it, spending time on optimisation should be balanced with the other uses that could be done of that time.
It's a waste when the effort is spent on code that does not end up in the product due to a change in the requirements. If optimisation is performed from the beginning, you may also spend lots of time optimising a part of the code that only marginally contribute to the overall time spent by the application.
Instead, you should probably wait until you have a clear vision of what the application is and what are the bottleneck before spending too much effort on optimisation. And then, you'll have a large suite of unit tests and of use cases that will allow you to optimise in confidence that you don't break the application and only spend your effort on parts that are really worth optimising thanks to profiling.
As always in engineering, optimisation is a tradeoff that you make. And you should definitely be sure that it is going to payoff before doing it if you mind your resources (time, money, ...).
In general, optimized code is more complex and difficult to get correct. It's also often counter productive to optimize code early (simply because you may be spending time optimizing something that doesn't provide any real improvement in overall performance).
So the guidance you're asking about really boils down to:
write code that easier to write and verify that it's correct
optimize that code when it makes sense to expend the effort
No matter how fast it runs, incorrect code is not optimized code.
Always profile before optimizing. If a small amount of code takes up a majority of the execution time and you can prove this from your profiling results, consider the programming effort to write, test, reprofile, maintain, and have someone else inherit this added complexity. Once you've done this, revert your code back to before you optimized it for runtime and deoptimized it for readability. Just don't do it. Seriously, unless over 90% of your execution is spent on one function, it's not worth the effort.
Keep in mind that a speedup of 10x on code that consumes 90% of your runtime will decrease your total runtime by a factor of ~5. A speedup of infinity on that slow function still only speeds up your entire program by a factor of 10. If you're hoping for more than an order of magnitude speed improvement (which is my threshold for whether I may start o think about optimizing), you will need to change how you approach a problem, and this kind of change means rethinking the architecture of the program. If you're lucky, it may be as simple as replacing your queue with a priority queue. Most likely you won't be lucky. Sorry the answer is bleak.
If optimization (by your compiler) is breaking your code while you believe it should not,
your code is not following the language standard, or
your compiler is broken, and you should upgrade it.
Language standards are quite complex to understand (in particular because not everything is specified, and some things are explicitly left unspecified or implementation specific). Read about undefined behavior
Compilers are in practice tested a big lot, and you should often first suspect your own code, and only after be sure your code is right (and fully standard conforming) suspect the compiler (in other words, compiler optimization bugs -where the generated code is wrong- are quite rare in practice).
Be sure to upgrade your compiler to a recent version. For GCC it is today (december 2013) 4.8.2; don't blame GCC if you are using a 4.4 or 3.6 GCC compiler, these ancient versions are not maintained anymore!
In practice, enable all warnings and debugging info when developping your code (e.g. compile with gcc -Wall -g at least, perhaps even with -Wextra). When you are sure of the quality of your code, compile it with optimizations and warnings (e.g. gcc -Wall -g -O2) and test it a lot.
In practice, profile the execution of your tests and (when possible) focus your efforts on the hot code (the one taking most of the CPU time).
Premature optimization is the root of all evil .... but sometimes you don't have really a choice, see audio codec implementation on ARM devices, in that case you need to get benefit from DSP ARM assembly extensions (like QADD, QSUB, QDADD, and QDSUB
) that can only be mapped on C code with multiple line instructions (highly inefficient), compilers cannot do a good job there, so you will need to optimize code inlining assembly.
You will probably write a "non optimized code" first in that case, but with the optimization in mind... so that when you will add optimization you won't need to change your code too much.
Another case in which you know you will need to optimize your code is when you will write signal processing functions (correlation, convolution, fft) for embedded devices. In that case you will have to do algorithmic optimization (choose the best method to approach the problem, choose the right approximation) and code optimizations (to use the pipeline properly for example) and it will be good to know that you are going to optimize the code before starting doing it (expecially the algorithmic one that can be performed on paper even before coding, and that can be tested separately).

Assembly language and compiled languages

How is assembly faster than compiled languages if both are translated to machine code?
I'm talking about truly compiled languages which are translated to machine code. Not C# or Java which are compiled to an intermediate language first and then compiled to native code by a software interpreter, etc.
On Wikipedia, I found something which I'm not sure if it's in any way related to this. Is it because that translation from a higher level language generates extra machine code? Or is my understanding wrong?
A utility program called an assembler is used to translate assembly language statements into the target computer's machine code. The assembler performs a more or less isomorphic translation (a one-to-one mapping) from mnemonic statements into machine instructions and data. This is in contrast with high-level languages, in which a single statement generally results in many machine instructions.
Well, it relates a bit to your question, indeed. The point is that compilers produce inefficient machine code at times for various reasons, such as not being able to completely analyze your code, inserting automatic range checks, automatic checks for objects being null, etc.
On the other hand if you write assembler code by hand and know what you're doing, then you can probably write some things much more efficient than the compiler, although the compiler's behavior may be tweaked and you can usually tell it not to do range checking, for example.
Most people, however, will not write better assembler code than a compiler, simply because compilers are written by people who know a good deal of really weird but really cool optimizations. Also things like loop unrolling are usually a pain to write yourself and make the resulting code faster in many cases.
While it's generally true that everything that a computer executes is machine code, the code that runs differs greatly depending on how many abstraction levels you put between the machine and the programmer. For Assembler that's one level, for Java there are a few more ...
Also many people mistakenly believe that certain optimizations at a higher abstraction layer pay off at a lower one. This is not necessarily the case and the compiler may just have trouble understanding what you are trying to do and fail to properly optimize it.
Assembly may sometimes be faster than a compiled language if an assembly programmer writes better assembly than that generated by the compiler.
A compiled language is often faster than assembly because programmers who write compilers usually know the CPU architecture better than programmers who are utilizing assembly in a one-off, limited-case, situation.
An assembly expert may be able to write assembly code that is more effective (fewer instructions, more efficient instructions, SIMD, ...) than what a compiler generates automatically.
However, most of the time, you're better off trusting the optimizer of your compiler.
Learn what your compiler does. Then let the compiler do it.
My standard answer when questions about assembly vs. high-level come up is to take a look at Michael Abrash's Graphics Programming Black Book.
The first couple of chapters give a good idea of what you can optimise effectively using assembly, and what you can't.
You can download it from GameDev - Jeff's links seem to be broken now unfortunately.
All good answers. My only additional point is that programmers tend to write a certain number of lines of code per day, regardless of language. Since the advantage of a high-level language is that it lets you get more done with less code, it takes incredible programmer discipline to actually write less code.
This is especially an issue for performance because it matters almost nowhere except in a tiny part of the code. It only matters in your hotspots - code that you write (1) consuming a significant fraction of execution time (2) without calling functions (3).
First of all, compilers generate very good (fast) assembly code.
It's true that compilers can add extra code since high order languages have mechanisms, like virtual methods and exceptions in C++. Thus the compiler will have to produce more code. There are cases where raw assembly could speed up the code but that's rare nowdays.
First - assembler should be used only in small code pieces, which eat most of the CPU time in a program - some kind of calculations for example - in the "bottle neck" of algorithm.
Secondly - it depends on experience in ASM of those who implements the same code in Assembler. If the assembler implementation of "bottle neck" code will be faster. If experience is low - it will be slower. And it will contain a lot of bugs. If experience is high enough - ASM will give significant profit.
How is assembly faster than compiled languages if both are translated to machine code?
The implicit assumption is hand-written assembly code. Of course, most compilers (e.g. GCC for C, C++, Fortran, Go, D etc...) are generating some assembler code; for example you might compile your foo.cc C++ source code with g++ -fverbose-asm -Wall -S -O2 -march=native foo.cc and look into the generated foo.s assembler code.
However, efficient assembler code is so difficult to write that, today, compilers can optimize better than human do. See this.
So practically speaking, it is not worth coding in assembler (also, take into account that development efforts cost very often much more than the hardware running the compiled code). Even when performance matters a lot and is worth spending a lot of money, it is better to hand-code only very few routines in assembler, or even to embed some assembler code in some of your C routines.
Look into the CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

Resources