Coding Style for GCC ARM Optimization levels - performance

I've been doing embedded firmware since 1977 but I have never enabled optimization on any of the compilers I've used.
I'm working with the GCC ARM compiler for a CM4 micro.
Code runs as expected with NO optimization.
I use a lot of structures and pointers in my code.
I use volatile when a variable can change from within an interrupt routine.
I recently need to speed up execution of my code so I used optimization level -Og (first time ever enabling optimization) - which still gives good debugging and increased performance where I wanted it.
My issue/concern is the code behaves really flacky!!!
It behaves OK - then I make a small change - and it mis-behaves -- changes each time I run the compiler - almost like there is an issue with address alignment or instructions have been completely removed.
I can change some variables to volatile and that also changes behavior but I don't understand why that would affect how global variables (not modified in an interrupt routine) would have a positive change in behavior.
I'm about ready to give up with over-all optimization and look at using function specific optimization since I know which functions affect the performance I'm trying to improve.
Can anyone explain how coding style can be impacted negatively with optimization?
Any good documents that address coding style with optimization in mind?
Does GCC function level optimization work well?
Thanks.
Joe

Related

Do compilers take the "status quo" when optimizations produced worse results?

To my knowledge, when using optimizations there is a risk to face the "maybe will be worse" case (i.e. the performance will be degraded, or the code size will be higher, or both). However do compilers able to detect such cases and return to the "status quo" (i.e. fall back to the original non-optimized code) when optimizations produced worse results? Can someone give (if possible) a particular examples of what compilers (for example, gcc, Clang (LLVM), etc.) do in this case?
In JIT compilers there is a thing called Deoptimization. Normally the compiler will optimize heavily assuming something, but during execution some of the assumption may fail. For example the compiler will assume the inmput of a function is always an integer and produce a highly efficient code for integer manipulation, but if, and such things happen in dynamic languages, the input is suddenly and array or a string, the code should revert. See v8 turbofan speculative optimizator for example.
For non JIT there is no way to deoptimize during runtime, but the compiler may create multiple execution paths. Your question is not fully logical because how would compiler know if it created unoptimal code? It can only use the same algorithm it used to do the optimization itself. That's probably why you are downwoted.

Why do some compilers like gcc, require you to activate optimizations? Why not run them by default?

Title basically, why not run all possible optimizations by default?
Enabling optimizations is less convenient when you're in a write-compile-test cycle: it makes compilation take longer, and it makes the resulting code work less well with a debugger (because the generated code may be structured very differently from what you wrote). So when trying to rapidly implement and test changes, many people prefer to have optimizations off, and then to turn then back on for final testing and for the build that will be shipped.
(There are downsides to this, of course. There are some warnings that the compiler can only issue when optimizing, such as uninitialized variables. And there are many cases where buggy code only actually misbehaves when optimizations are on. So there is some benefit to using optimizations more often. Still, if you're trying to track down a bug that is reproducible with optimizations off, it is definitely easier to debug it that way.)
This means that if you were to count up the total number of times that the compiler is run, the vast majority will be runs where the user prefers not to optimize. Therefore it makes a certain amount of sense for "no optimization" to be the default.
Of course, now people are used to this behavior, and so even if it were better to have the opposite default, changing it would cause confusion.

what is -o3 optimization flag used in gcc

Can someone tell me the simple question why we need to put -o3 associates with -o in gcc to compile c program, simply it will help to increase the speed of compilation or reduced the time lapsed spending in compilation?
Thanks!!!
It can potentially increase the performance of the generated code.
In principle, compilation usually takes longer because this requires (much) more analysis by the compiler.
For typical modern C++ code, the effect of -O2 and higher can be very dramatic (an order of magnitude, depending on the nature of the program).
Precisely which optimizations are performed at the various optimization levels is documented in the manual pages: http://linux.die.net/man/1/gcc
Keep in mind, though that the various optimizations can potentially make latent bugs manifest, because compilers are allowed to exploit Undefined Behaviour¹ to achieve more efficient target code.
Undefined Behaviour lurks in places where the language standard(s) do not specify exactly what needs to happen. These can be extremely subtle.
So I recommend on using anything higher than -O2 unless you have rigid quality controls in place that guard against such hidden undefined behaviours (think of valgrind/purify, statical analysis tools and (stress) testing in general).
¹ A very profound blog about undefined behaviour in optimizing compilers is here: http://blog.regehr.org/archives/213 . In particular it let's you take the perspective of a compiler writer, whose only objective is to generate the fastest possible code that still satisfies the specifications.

likely and unlikely macros

Are there any standard ways ( using profilers ) to check if using these gcc recognized branch prediction macros can benefit certain clock cycles in terms of instruction pipelining? How can we measure this with and without usage of these macros in a program? Is measuring the elapsed time the only way to do it?
Are there similar branch prediction macros in Windows ( assume keywork for example? )
-Kartlee
I’m not familiar with any profilers that will show branch efficiencies. The Linux time program should work well enough to help you benchmark.
On all modern x86 CPUs, JMPcc instructions are faster if they don’t branch and instead just fall through to the next instruction.
GCC’s __builtin_expect function provides a hint to the compiler—it tells which side of an if() should be the fall-through and which side should be the branch. You should only use this function if you are 100% sure about it. There is no equivalent function for VC++. I’m not sure about ICC.
A better way to do this is to avoid these non-standard functions and use Profile Guided Optimization (PGO), in which you run the program and it records all these branches to figure out where stuff goes.

Questions about possible java(or other memory managed language) optimizations

From what I have read java (usually) seems to compile java to not very (is at all?) optimised java bytecode, leaving it to the jit to optimise. Is this true? And if it is has there been any exploration (possibly in alternative implementations) of getting the compiler to optimise the code so the jit has less work to do (is this possible)?
Also many people seem to have a dislike for native code generation (sometimes referred to as ahead of time compilation) for Java (and many other high level memory managed languages) , for many reasons such as loss of portability (and ect.) , but also partially because (at least for those languages that have a just in time compiler) the thinking goes that ahead of time compilation to machine code will miss the possible optimisations that can be done by a jit compiler and therefore may be slower in the long run.
This leads me to wonder whether anyone has ever tried to implement http://en.wikipedia.org/wiki/Profile-guided_optimization (compiling to a binary + some extras then running the program and analysing the runtime information of the test run to generate a hopefully more optimised binary for real world usage) for java/(other memory managed languages) and how this would compare to jit code? Anyone have a clue?
Personally, I think the big difference is not between JIT compiling and AOT compiling, but between class-compilation and whole-program optimization.
When you run javac, it only looks at a single .java file, compiling it into a single .class file. All the interface implementations and virtual methods and overrides are checked for validity but left unresolved (because it's impossible to know the true method invocation targets without analyzing the whole program).
The JVM uses "runtime loading and linking" to assemble all of your classes into a coherent program (and any class in your program can invoke specialized behavior to change the default loading/linking behavior).
But then, at runtime, the JVM can remove the vast majority of virtual methods. It can inline all of your getters and setters, turning them into raw fields. And when those raw fields are inlined, it can perform constant-propagation to further optimize the code. (At runtime, there's no such thing as a private field.) And if there's only one thread running, the JVM can eliminate all synchronization primitives.
To make a long story short, there are a lot of optimizations that aren't possible without analyzing the whole program, and the best time for doing whole program analysis is at runtime.
Profile-guided optimization has some caveats, one of them mentioned even in the Wiki article you linked. It's results are valid
for the given samples, representing how your code is actually used by the user or other code.
for the given platform (CPU, memory + other hardware, OS, whatever).
From the performance point of view there are quite big differences even among platforms that are usually considered (more or less) the same (e.g. compare a single core, old Athlon with 512M with a 6 core Intel with 8G, running on Linux, but with very different kernel versions).
for the given JVM and its config.
If any of these change then your profiling results (and the optimizations based on them) are not necessary valid any more. Most likely some of the optimizations will still have a beneficial effect, but some of them may turn out suboptimal (or even degrading performance).
As it was mentioned the JIT JVMs do something very similar to profiling, but they do it on the fly. It's also called 'hotspot', because it constantly monitors the executed code, looks for hot spots that are executed frequently and will try to optimize only those parts. At this point it will be able to exploit more knowledge about the code (knowing the context of it, how it is used by other classes, etc.) so - as mentioned by you and the other answers - it can do better optimizations as a static one. It will continue monitoring and if its needed it will do another turn of optimization later, this time trying even harder (looking for more, more expensive optimizations).
Working on the real life data (usage statistics + platform + config) it can avoid the caveats mentioned before.
The price of it is some additional time it needs to spend on "profiling" + JIT-ing. Most of the time its spent quite well.
I guess a profile-guided optimizer could still compete with it (or even beat it), but only in some special cases, if you can avoid the caveats:
you are quite sure that your samples represent the real life scenario well and they won't change too much during execution.
you know your target platform quite precisely and can do the profiling on it.
and of course you know/control the JVM and its config.
It will happen rarely and I guess in general JIT will give you better results, but I have no evidence for it.
Another possibility for getting value from the profile-guided optimization if you target a JVM that can't do JIT optimization (I think most small devices have such a JVM).
BTW one disadvantage mentioned in other answers would be quite easy to avoid: if static/profile guided optimization is slow (which is probably the case) then do it only for releases (or RCs going to testers) or during nightly builds (where time does not matter so much).
I think the much bigger problem would be to have good sample test cases. Creating and maintaining them is usually not easy and takes a lot of time. Especially if you want to be able to execute them automatically, which would be quite essential in this case.
The official Java Hot Spot compiler does "adaptive optimisation" at runtime, which is essentially the same as the profile-guided optimisation you mentioned. This has been a feature of at least this particular Java implementation for a long time.
The trade-off to performing more static analysis or optimisation passes up-front at compile time is essentially the (ever-diminishing) returns you get from this extra effort against the time it takes for the compiler to run. A compiler like MLton (for Standard ML) is a whole-program optimising compiler with a lot of static checks. It produces very good code, but becomes very, very slow on medium-to-large programs, even on a fast system.
So the Java approach seems to be to use JIT and adaptive optimisation as much as possible, with the initial compilation pass just producing an acceptable valid binary. The absolute opposite end is to use an approach like that of something like MLKit, which does a lot of static inference of regions and memory behaviour.

Resources