This question already has answers here:
Are there any drawbacks to using -O3 in GCC?
(4 answers)
Closed 1 year ago.
Should I always specify the -O3 flag when compiling a release version with gcc, or is there any other possible drawback?
Should I always specify the -O3 flag when compiling a release version with gcc?
No, or at least maybe not. For performance; sometimes -O3 makes code that is slower than you get from -O2.
Under the hood; it's really a bunch of different optimizations that can be enabled/disabled individually; where -O3 (and -O2 and -Os) is just a convenient shorthand for enabling a group of many optimizations. -O2 is supposed to represent "enable all optimizations that always help", and -O3 is supposed to represent "enable all optimizations that often help (but may make things worse)". Which actual optimizations are/aren't enabled for each -O setting is detailed in the manual (at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html ).
If you don't use the shortcuts and specify individual optimizations yourself; then (using a laborious "trial and error" approach and benchmarking the result for each case) you can find the set of optimizations that always help your program (and avoid enabling optimizations that make the performance of your program worse).
A more practical approach would be to start with O2 then determine which of the optimizations that aren't already enabled by -O2 also help.
However; performance isn't the only thing that matters. To save time; most people just try -O2 or -O3 and pick whatever seems fastest. Part of the reason for this is that your software and the compiler is constantly changing; so any "laborious benchmarking" you do would need to be done again regularly.
Note: to actually get the maximum performance possible, each translation unit can be compiled with different optimization settings (so you can do the "laborious trial and error" for each individual source file); and then the resulting set of "optimized differently" object files can be fed into a link-time optimizer to optimize more.
I was quite surprised to see that even a simple program like:
print_string "Hello world !\n";
when statically compiled to native code through ocamlopt with some quite aggressive options (using musl), would still be around ~190KB on my system.
$ ocamlopt.opt -compact -verbose -o helloworld \
-ccopt -static \
-ccopt -s \
-ccopt -ffunction-sections \
-ccopt -fdata-sections \
-ccopt -Wl \
-ccopt -gc-sections \
-ccopt -fno-stack-protector \
helloworld.ml && { ./helloworld ; du -h helloworld; }
+ as -o 'helloworld.o' '/tmp/camlasm759655.s'
+ as -o '/tmp/camlstartupfc4271.o' '/tmp/camlstartup5a7610.s'
+ musl-gcc -Os -o 'helloworld' '-L/home/vaab/.opam/4.02.3+musl+static/lib/ocaml' -static -s -ffunction-sections -fdata-sections -Wl -gc-sections -fno-stack-protector '/tmp/camlstartupfc4271.o' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/std_exit.o' 'helloworld.o' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/stdlib.a' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/libasmrun.a' -static -lm
Hello world !
196K helloworld
How to get the smallest binary from ocamlopt ?
A size of 190KB is way too much for a simple program like that in today's constraints (iot, android, alpine VM...), and compares badly with simple C program (around ~6KB, or directly coding ASM and tweaking things to get a working binary that could be around 150B). I naïvely thought that I could simply ditch C to write simple static program that would do trivial things and after compilation I would get some simple assembly code that wouldn't be so far in size with the equivalent C program. Is that possible ?
What I think I understand:
When removing gcc's -s to have some hints about what is left in the binary, I can notice a lot of ocaml symbols, and I also kinda read that some environment variable of ocamlrun are meant to be interpreted even in this form. It is as if what ocamlopt calls "native compilation" is about packing ocamlrun and the not-native bytecode of your program in one file and make it executable. Not exactly what I would have expected. I obviously missed some important point. But if that is the case, I'll be interested why it isn't as I expected.
Other languages compiling to native code having the same issue: leaving some naïve user (as myself) with roughly the same questions:
Go: Reason for huge size of compiled executable of Go
Rust: Why are Rust executables so huge?
I've tested also with Haskell, and without tweaks, all languages compilers are making binaries above 700KB for the "hello world" program (it was the same for Ocaml before the tweaks).
Your question is very broad and I'm not sure that it fits the format of Stackoverflow. It deserves a thorough discussion.
A size of 190KB is way too much for a simple program like that in today's constraints (iot, android, alpine VM...), and compares badly with simple C program (around ~6KB, or directly coding ASM and tweaking things to get a working binary that could be around 150B)
First of all, it is not a fair comparison. Nowadays, a compiled C binary is an artifact that is far from being a standalone binary. It should be seen more like a plugin in a framework. Therefore, if you would like to count how many bytes a given binary actually uses, we shall count the size of the loader, shell, the libc library, and the whole linux or windows kernel - which in total form the runtime of an application.
OCaml, unlike Java or Common Lisp, is very friendly to the common C runtime and tries to reuse most of its facilities. But OCaml still comes with its own runtime, in which the biggest (and most important part) is the garbage collector. The runtime is not extremely big (about 30 KLOC) but still contributes to the weight. And since OCaml uses static linking every OCaml program will have a copy of it.
Therefore, C binaries have a significant advantage as they are usually run in systems where the C runtime is already available (therefore it is usually excluded from the equation). There are, however, systems where there is no C runtime at all, and only OCaml runtime is present, see Mirage for example. In such systems, OCaml binaries are much more favorable. Another example is the OCaPic project, in which (after tweaking the compiler and runtime) they managed to fit OCaml runtime and programs into 64Kb Flash (read the paper it is very insightful about the binary sizes).
How to get the smallest binary from ocamlopt?
When it is really necessary to minimize the size, use Mirage Unikernels or implement your own runtime. For general cases, use strip and upx. (For example, with upx --best I was able to reduce the binary size of your example to 50K, without any more tricks). If performance doesn't matter that much, then you can use bytecode, which is usually smaller than the machine code. Thus you will pay once (about 200k for the runtime), and few bytes for each program (e.g., 200 bytes for your helloworld).
Also, do not create many small binaries, but create one binary. In your particular example, the size of the helloworld compilation unit is 200 bytes in bytecode and 700 bytes in machine code. The rest 50k is the startup harness which should be included only once. Moreover, since OCaml supports dynamic linking in runtime, you can easily create a loader that will load modules when needed. And in this scenario, the binaries will become very small (hundreds of bytes).
It is as if what ocamlopt calls "native compilation" is about packing ocamlrun and the not-native bytecode of your program in one file and make it executable. Not exactly what I would have expected. I obviously missed some important point. But if that is the case, I'll be interested why it isn't as I expected.
No-no, it is completely wrong. Native compilation is when a program is compiled to the machine code, whether it is x86, ARM, or whatever. The runtime is written in C, compiled to machine code, and is also linked. The OCaml Standard Library is written mostly in OCaml, also compiled to machine code, and is also linked into the binary (only those modules that are used, OCaml static linking is very efficient, provided that the program is split into modules (compilation units) fairly well).
Concerning the OCAMLRUNPARAM environment variable, it is just an environment variable that parameterizes the behavior of the runtime, mostly the parameters of the garbage collector.
I have few programs (written in C) implementing some algorithms, that I use to measure computation time. Whole data is implemented as static libraries directly in code, there's no input and output from these programs. There's also no C library calls (no printfs etc.).
I want to build fully independent and minimal executable. I don't want to link my program with libgcc (target CPU has coprocessor, so I don't need to emulate floating point arithmetic), C library or any other. Actually I want to make my program as independent as it's possible. On Linux ELF program has to be linked only with crt0.o to run properly, right?
I'm mostly asking because I'm curious ;)
Link with gcc -nostdlib, then use objdump -h and strip --remove-section=... to really make it small by getting rid of silly things like the comment section and the exception handling frame information sections. Keep removing sections until it stops working.
And compile with -Os of course
I have a C++ program which is compiled under gcc (gcc version 4.5.1) with the -O3 flag. I'm thinking about whether or not it would be worthwhile making an SSE2 version of this program (or at least, the busiest of it). However, I'm worried that the compiler has already done this through automatic vectorization.
Question: How do I determine (a) whether or not my program is using SSE/SSE2 and (b) how much time is spent using SSE/SSE2 (i.e. profiling)?
The easiest way to tell if you are gaining any benefit from compiler vectorization is to run the code with and without the -ftree-vectorize flag and compare the results.
-O3 will automatically enable that option. So you might want to try it under -O2 instead.
To see which loops were vectorized, which were not, and why, you can add the -ftree-vectorizer-verbose option.
The last option, of course, is to look at the assembly. It's very easy to identify vectorized code in assembly.
If I include <stdlib.h> or <stdio.h> in a C program, I don't have to link these when compiling, but I do have to link to <math.h>, using -lm with GCC, for example:
gcc test.c -o test -lm
What is the reason for this? Why do I have to explicitly link the math library, but not the other libraries?
The functions in stdlib.h and stdio.h have implementations in libc.so (or libc.a for static linking), which is linked into your executable by default (as if -lc were specified). GCC can be instructed to avoid this automatic link with the -nostdlib or -nodefaultlibs options.
The math functions in math.h have implementations in libm.so (or libm.a for static linking), and libm is not linked in by default. There are historical reasons for this libm/libc split, none of them very convincing.
Interestingly, the C++ runtime libstdc++ requires libm, so if you compile a C++ program with GCC (g++), you will automatically get libm linked in.
Remember that C is an old language and that FPUs are a relatively recent phenomenon. I first saw C on 8-bit processors where it was a lot of work to do even 32-bit integer arithmetic. Many of these implementations didn't even have a floating point math library available!
Even on the first 68000 machines (Mac, Atari ST, Amiga), floating point coprocessors were often expensive add-ons.
To do all that floating point math, you needed a pretty sizable library. And the math was going to be slow. So you rarely used floats. You tried to do everything with integers or scaled integers. When you had to include math.h, you gritted your teeth. Often, you'd write your own approximations and lookup tables to avoid it.
Trade-offs existed for a long time. Sometimes there were competing math packages called "fastmath" or such. What's the best solution for math? Really accurate but slow stuff? Inaccurate but fast? Big tables for trig functions? It wasn't until coprocessors were guaranteed to be in the computer that most implementations became obvious. I imagine that there's some programmer out there somewhere right now, working on an embedded chip, trying to decide whether to bring in the math library to handle some math problem.
That's why math wasn't standard. Many or maybe most programs didn't use a single float. If FPUs had always been around and floats and doubles were always cheap to operate on, no doubt there would have been a "stdmath".
Because of ridiculous historical practice that nobody is willing to fix. Consolidating all of the functions required by C and POSIX into a single library file would not only avoid this question getting asked over and over, but would also save a significant amount of time and memory when dynamic linking, since each .so file linked requires the filesystem operations to locate and find it, and a few pages for its static variables, relocations, etc.
An implementation where all functions are in one library and the -lm, -lpthread, -lrt, etc. options are all no-ops (or link to empty .a files) is perfectly POSIX conformant and certainly preferable.
Note: I'm talking about POSIX because C itself does not specify anything about how the compiler is invoked. Thus you can just treat gcc -std=c99 -lm as the implementation-specific way the compiler must be invoked for conformant behavior.
Because time() and some other functions are builtin defined in the C library (libc) itself and GCC always links to libc unless you use the -ffreestanding compile option. However math functions live in libm which is not implicitly linked by gcc.
An explanation is given here:
So if your program is using math functions and including math.h, then you need to explicitly link the math library by passing the -lm flag. The reason for this particular separation is that mathematicians are very picky about the way their math is being computed and they may want to use their own implementation of the math functions instead of the standard implementation. If the math functions were lumped into libc.a it wouldn't be possible to do that.
[Edit]
I'm not sure I agree with this, though. If you have a library which provides, say, sqrt(), and you pass it before the standard library, a Unix linker will take your version, right?
There's a thorough discussion of linking to external libraries in An Introduction to GCC - Linking with external libraries. If a library is a member of the standard libraries (like stdio), then you don't need to specify to the compiler (really the linker) to link them.
After reading some of the other answers and comments, I think the libc.a reference and the libm reference that it links to both have a lot to say about why the two are separate.
Note that many of the functions in 'libm.a' (the math library) are defined in 'math.h' but are not present in libc.a. Some are, which may get confusing, but the rule of thumb is this--the C library contains those functions that ANSI dictates must exist, so that you don't need the -lm if you only use ANSI functions. In contrast, `libm.a' contains more functions and supports additional functionality such as the matherr call-back and compliance to several alternative standards of behavior in case of FP errors. See section libm, for more details.
As ephemient said, the C library libc is linked by default and this library contains the implementations of stdlib.h, stdio.h and several other standard header files. Just to add to it, according to "An Introduction to GCC" the linker command for a basic "Hello World" program in C is as below:
ld -dynamic-linker /lib/ld-linux.so.2 /usr/lib/crt1.o
/usr/lib/crti.o /usr/libgcc-lib /i686/3.3.1/crtbegin.o
-L/usr/lib/gcc-lib/i686/3.3.1 hello.o -lgcc -lgcc_eh -lc
-lgcc -lgcc_eh /usr/lib/gcc-lib/i686/3.3.1/crtend.o /usr/lib/crtn.o
Notice the option -lc in the third line that links the C library.
If I put stdlib.h or stdio.h, I don't have to link those but I have to link when I compile:
stdlib.h, stdio.h are the header files. You include them for your convenience. They only forecast what symbols will become available if you link in the proper library. The implementations are in the library files, that's where the functions really live.
Including math.h is only the first step to gaining access to all the math functions.
Also, you don't have to link against libm if you don't use it's functions, even if you do a #include <math.h> which is only an informational step for you, for the compiler about the symbols.
stdlib.h, stdio.h refer to functions available in libc, which happens to be always linked in so that the user doesn't have to do it himself.
It's a bug. You shouldn't have to explicitly specify -lm any more. Perhaps if enough people complain about it, it will be fixed. (I don't seriously believe this, as the maintainers who are perpetuating the distinction are evidently very stubborn, but I can hope.)
I think it's kind of arbitrary. You have to draw a line somewhere (which libraries are default and which need to be specified).
It gives you the opportunity to replace it with a different one that has the same functions, but I don't think it's very common to do so.
I think GCC does this to maintain backwards compatibility with the original cc executable. My guess for why cc does this is because of build time -- cc was written for machines with far less power than we have now. A lot of programs don't have any floating-point math, and they probably took every library that wasn't commonly used out of the default. I'm guessing that the build time of the Unix OS and the tools that go along with it were the driving force.
I would guess that it is a way to make applications which don't use it at all perform slightly better. Here's my thinking on this.
x86 OSes (and I imagine others) need to store FPU state on context switch. However, most OSes only bother to save/restore this state after the app attempts to use the FPU for the first time.
In addition to this, there is probably some basic code in the math library which will set the FPU to a sane base state when the library is loaded.
So, if you don't link in any math code at all, none of this will happen, therefore the OS doesn't have to save/restore any FPU state at all, making context switches slightly more efficient.
Just a guess though.
The same base premise still applies to non-FPU cases (the premise being that it was to make apps which didn't make use libm perform slightly better).
For example, if there is a soft-FPU which was likely in the early days of C. Then having libm separate could prevent a lot of large (and slow if it was used) code from unnecessarily being linked in.
In addition, if there is only static linking available, then a similar argument applies that it would keep executable sizes and compile times down.
stdio is part of the standard C library which, by default, GCC will link against.
The math function implementations are in a separate libm file that is not linked to by default, so you have to specify it -lm. By the way, there is no relation between those header files and library files.
All libraries like stdio.h and stdlib.h have their implementation in libc.so or libc.a and get linked by the linker by default. The libraries for libc.so are automatically linked while compiling and is included in the executable file.
But math.h has its implementations in libm.so or libm.a which is separate from libc.so. It does not get linked by default and you have to manually link it while compiling your program in GCC by using the -lm flag.
The GNU GCC team designed it to be separate from the other header files, while the other header files get linked by default, but math.h file doesn't.
Here read the item number 14.3, you could read it all if you wish:
Reason why math.h is needs to be linked
Look at this article: Why do we have to link math.h in GCC?
Have a look at the usage:
Using the library
Note that -lm may not always need to be specified even if you use some C math functions.
For example, the following simple program:
#include <stdio.h>
#include <math.h>
int main() {
printf("output: %f\n", sqrt(2.0));
return 0;
}
can be compiled and run successfully with the following command:
gcc test.c -o test
It was tested on GCC 7.5.0 (on Ubuntu 16.04) and GCC 4.8.0 (on CentOS 7).
The post here gives some explanations:
The math functions you call are implemented by compiler built-in functions
See also:
Other Built-in Functions Provided by GCC
How to get the gcc compiler to not optimize a standard library function call like printf?