how to get the smallest ocamlopt compiled native binary? - compilation

I was quite surprised to see that even a simple program like:
print_string "Hello world !\n";
when statically compiled to native code through ocamlopt with some quite aggressive options (using musl), would still be around ~190KB on my system.
$ ocamlopt.opt -compact -verbose -o helloworld \
-ccopt -static \
-ccopt -s \
-ccopt -ffunction-sections \
-ccopt -fdata-sections \
-ccopt -Wl \
-ccopt -gc-sections \
-ccopt -fno-stack-protector \
helloworld.ml && { ./helloworld ; du -h helloworld; }
+ as -o 'helloworld.o' '/tmp/camlasm759655.s'
+ as -o '/tmp/camlstartupfc4271.o' '/tmp/camlstartup5a7610.s'
+ musl-gcc -Os -o 'helloworld' '-L/home/vaab/.opam/4.02.3+musl+static/lib/ocaml' -static -s -ffunction-sections -fdata-sections -Wl -gc-sections -fno-stack-protector '/tmp/camlstartupfc4271.o' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/std_exit.o' 'helloworld.o' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/stdlib.a' '/home/vaab/.opam/4.02.3+musl+static/lib/ocaml/libasmrun.a' -static -lm
Hello world !
196K helloworld
How to get the smallest binary from ocamlopt ?
A size of 190KB is way too much for a simple program like that in today's constraints (iot, android, alpine VM...), and compares badly with simple C program (around ~6KB, or directly coding ASM and tweaking things to get a working binary that could be around 150B). I naïvely thought that I could simply ditch C to write simple static program that would do trivial things and after compilation I would get some simple assembly code that wouldn't be so far in size with the equivalent C program. Is that possible ?
What I think I understand:
When removing gcc's -s to have some hints about what is left in the binary, I can notice a lot of ocaml symbols, and I also kinda read that some environment variable of ocamlrun are meant to be interpreted even in this form. It is as if what ocamlopt calls "native compilation" is about packing ocamlrun and the not-native bytecode of your program in one file and make it executable. Not exactly what I would have expected. I obviously missed some important point. But if that is the case, I'll be interested why it isn't as I expected.
Other languages compiling to native code having the same issue: leaving some naïve user (as myself) with roughly the same questions:
Go: Reason for huge size of compiled executable of Go
Rust: Why are Rust executables so huge?
I've tested also with Haskell, and without tweaks, all languages compilers are making binaries above 700KB for the "hello world" program (it was the same for Ocaml before the tweaks).

Your question is very broad and I'm not sure that it fits the format of Stackoverflow. It deserves a thorough discussion.
A size of 190KB is way too much for a simple program like that in today's constraints (iot, android, alpine VM...), and compares badly with simple C program (around ~6KB, or directly coding ASM and tweaking things to get a working binary that could be around 150B)
First of all, it is not a fair comparison. Nowadays, a compiled C binary is an artifact that is far from being a standalone binary. It should be seen more like a plugin in a framework. Therefore, if you would like to count how many bytes a given binary actually uses, we shall count the size of the loader, shell, the libc library, and the whole linux or windows kernel - which in total form the runtime of an application.
OCaml, unlike Java or Common Lisp, is very friendly to the common C runtime and tries to reuse most of its facilities. But OCaml still comes with its own runtime, in which the biggest (and most important part) is the garbage collector. The runtime is not extremely big (about 30 KLOC) but still contributes to the weight. And since OCaml uses static linking every OCaml program will have a copy of it.
Therefore, C binaries have a significant advantage as they are usually run in systems where the C runtime is already available (therefore it is usually excluded from the equation). There are, however, systems where there is no C runtime at all, and only OCaml runtime is present, see Mirage for example. In such systems, OCaml binaries are much more favorable. Another example is the OCaPic project, in which (after tweaking the compiler and runtime) they managed to fit OCaml runtime and programs into 64Kb Flash (read the paper it is very insightful about the binary sizes).
How to get the smallest binary from ocamlopt?
When it is really necessary to minimize the size, use Mirage Unikernels or implement your own runtime. For general cases, use strip and upx. (For example, with upx --best I was able to reduce the binary size of your example to 50K, without any more tricks). If performance doesn't matter that much, then you can use bytecode, which is usually smaller than the machine code. Thus you will pay once (about 200k for the runtime), and few bytes for each program (e.g., 200 bytes for your helloworld).
Also, do not create many small binaries, but create one binary. In your particular example, the size of the helloworld compilation unit is 200 bytes in bytecode and 700 bytes in machine code. The rest 50k is the startup harness which should be included only once. Moreover, since OCaml supports dynamic linking in runtime, you can easily create a loader that will load modules when needed. And in this scenario, the binaries will become very small (hundreds of bytes).
It is as if what ocamlopt calls "native compilation" is about packing ocamlrun and the not-native bytecode of your program in one file and make it executable. Not exactly what I would have expected. I obviously missed some important point. But if that is the case, I'll be interested why it isn't as I expected.
No-no, it is completely wrong. Native compilation is when a program is compiled to the machine code, whether it is x86, ARM, or whatever. The runtime is written in C, compiled to machine code, and is also linked. The OCaml Standard Library is written mostly in OCaml, also compiled to machine code, and is also linked into the binary (only those modules that are used, OCaml static linking is very efficient, provided that the program is split into modules (compilation units) fairly well).
Concerning the OCAMLRUNPARAM environment variable, it is just an environment variable that parameterizes the behavior of the runtime, mostly the parameters of the garbage collector.

Related

Do all compiled codes have same speed no matter what language they were written in?

Suppose I write a program in both Python and C++ and I turn these to executable. Now, will both the executable have the same speed or will it vary (I guess it shouldn't cause it should now be in machine code form) ?
Suppose I write a program in both Python and C++ and I turn these to executable. Now, will both the executable have the same speed
Of course usually not (assuming both code implement the same algorithm). And the runtime speed depends a lot of the compiler itself (e.g. tinycc -for C- and GCC or Clang ....) and even of its versions and compilation flags (e.g. -Os vs -O2 with g++). BTW, Python is compiled to some bytecode, not to machine code.
Of course, some software are mostly spending CPU time elsewhere (e.g. in some relational database manager such as PostGreSQL). Then rewriting them in C++ instead of Python won't gain a lot of performance. And some software are mostly IO bound (e.g. tar(1) used without compression)
At last, some C++ programs could generate machine code at runtime (e.g. using AsmJit...) using partial evaluation techniques, which may give a huge speedup.
On Linux, you could generate some C or C++ code at runtime, compile it as a temporary plugin, then dlopen(3) that temporary plugin (fetching new function pointers using dlsym(3)... Adapt the manydl.c example to your needs)
Also, C++ is a very difficult language to learn. Read some good book about it.
Read of course the Dragon book.
Since an entire book is needed to answer your question !

identify whether an ELF binary is built with optimizations

I know we can use cmake or make to control CFLAGS or CXXFLAGS for building release version manually. release version to me means -O2 or -O3 is given at least, it doesn't matter whether -g or strip is performed, whereas debug is given -O0.
However sometimes I want use scripts to decide whether an elf binary was built with optimizations, so I can decide what to do next. I tried objdump or file or readelf, but found no answers. Are there any alternatives that I can do this ?
However sometimes I want use scripts to decide whether an elf binary was built with optimizations, so I can decide what to do next.
The problem with your question is that your binary is very likely to contain some code built with optimizations (e.g. crt0.o -- part of GLIBC).
Your final binary is composed of a bunch of object files, and these files do not have to have consistent optimization flags.
Most likely you only care about the code you wrote (as opposed to code linked from other libraries). You could use -frecord-gcc-switches for that. See this answer.

GCC optimization levels. Which is better?

I am focusing on the CPU/memory consumption of compiled programs by GCC.
Executing code compiled with O3 is it always so greedy in term of resources ?
Is there any scientific reference or specification that shows the difference of Mem/cpu consumption of different levels?
People working on this problem often focus on the impact of these optimizations on the execution time, compiled code size, energy. However, I can't find too much work talking about resource consumption (by enabling optimizations).
Thanks in advance.
No, there is no absolute way, because optimization in compilers is an art (and is even not well defined, and might be undecidable or intractable).
But some guidelines first:
be sure that your program is correct and has no bugs before optimizing anything, so do debug and test your program
have well designed test cases and representative benchmarks (see this).
be sure that your program has no undefined behavior (and this is tricky, see this), since GCC will optimize strangely (but very often correctly, according to C99 or C11 standards) if you have UB in your code; use the -fsanitize=style options (and gdb and valgrind ....) during debugging phase.
profile your code (on various benchmarks), in particular to find out what parts are worth optimization efforts; often (but not always) most of the CPU time happens in a small fraction of the code (rule of thumb: 80% of time spent in 20% of code; on some applications like the gcc compiler this is not true, check with gcc -ftime-report to ask gcc to show time spent in various compiler modules).... Most of the time "premature optimization is the root of all evil" (but there are exceptions to this aphorism).
improve your source code (e.g. use carefully and correctly restrict and const, add some pragmas or function or variable attributes, perhaps use wisely some GCC builtins __builtin_expect, __builtin_prefetch -see this-, __builtin_unreachable...)
use a recent compiler. Current version (october 2015) of GCC is 5.2 (and GCC 8 in june 2018) and continuous progress on optimization is made ; you might consider compiling GCC from its source code to have a recent version.
enable all warnings (gcc -Wall -Wextra) in the compiler, and try hard to avoid all of them; some warnings may appear only when you ask for optimization (e.g. with -O2)
Usually, compile with -O2 -march=native (or perhaps -mtune=native, I assume that you are not cross-compiling, if you do add the good -march option ...) and benchmark your program with that
Consider link-time optimization by compiling and linking with -flto and the same optimization flags. E.g., put CC= gcc -flto -O2 -march=native in your Makefile (then remove -O2 -mtune=native from your CFLAGS there)...
Try also -O3 -march=native, usually (but not always, you might sometimes has slightly faster code with -O2 than with -O3 but this is uncommon) you might get a tiny improvement over -O2
If you want to optimize the generated program size, use -Os instead of -O2 or -O3; more generally, don't forget to read the section Options That Control Optimization of the documentation. I guess that both -O2 and -Os would optimize the stack usage (which is very related to memory consumption). And some GCC optimizations are able to avoid malloc (which is related to heap memory consumption).
you might consider profile-guided optimizations, -fprofile-generate, -fprofile-use, -fauto-profile options
dive into the documentation of GCC, it has numerous optimization & code generation arguments (e.g. -ffast-math, -Ofast ...) and parameters and you could spend months trying some more of them; beware that some of them are not strictly C standard conforming!
recent GCC and Clang can emit DWARF debug information (somehow "approximate" if strong optimizations have been applied) even when optimizing, so passing both -O2 and -g could be worthwhile (you still would be able, with some pain, to use the gdb debugger on optimized executable)
if you have a lot of time to spend (weeks or months), you might customize GCC using MELT (or some other plugin) to add your own new (application-specific) optimization passes; but this is difficult (you'll need to understand GCC internal representations and organization) and probably rarely worthwhile, except in very specific cases (those when you can justify spending months of your time for improving optimization)
you might want to understand the stack usage of your program, so use -fstack-usage
you might want to understand the emitted assembler code, use -S -fverbose-asm in addition of optimization flags (and look into the produced .s assembler file)
you might want to understand the internal working of GCC, use various -fdump-* flags (you'll get hundred of dump files!).
Of course the above todo list should be used in an iterative and agile fashion.
For memory leaks bugs, consider valgrind and several -fsanitize= debugging options. Read also about garbage collection (and the GC handbook), notably Boehm's conservative garbage collector, and about compile-time garbage collection techniques.
Read about the MILEPOST project in GCC.
Consider also OpenMP, OpenCL, MPI, multi-threading, etc... Notice that parallelization is a difficult art.
Notice that even GCC developers are often unable to predict the effect (on CPU time of the produced binary) of such and such optimization. Somehow optimization is a black art.
Perhaps gcc-help#gcc.gnu.org might be a good place to ask more specific & precise and focused questions about optimizations in GCC
You could also contact me on basileatstarynkevitchdotnet with a more focused question... (and mention the URL of your original question)
For scientific papers on optimizations, you'll find lots of them. Start with ACM TOPLAS, ACM TACO etc... Search for iterative compiler optimization etc.... And define better what resources you want to optimize for (memory consumption means next to nothing....).

Does a compiler always produce an assembly code?

From Thinking in C++ - Vol 1:
In the second pass, the code generator walks through the parse tree
and generates either assembly language code or machine code for the
nodes of the tree.
Well at least in GCC if we give the option of generating the assembly code, the compiler obeys by creating a file containing assembly code. But, when we simply run the command gcc without any options does it not produce the assembly code internally?
If yes, then why does it need to first produce an assembly code and then translate it to machine language?
TL:DR different object file formats / easier portability to new Unix platforms (historically) is one of the main reasons for gcc keeping the assembler separate from the compiler, I think. Outside of gcc, the mainstream x86 C and C++ compilers (clang/LLVM, MSVC, ICC) go straight to machine code, with the option of printing asm text if you ask them to.
LLVM and MSVC are / come with complete toolchains, not just compilers. (Also come with assembler and linker). LLVM already has object-file handling as a library function, so it can use that instead of writing out asm text to feed to a separate program.
Smaller projects often choose to leave object-file format details to the assembler. e.g. FreePascal can go straight to an object file on a few of its target platforms, but otherwise only to asm. There are many claims (1, 2, 3, 4) that almost all compilers go through asm text, but that's not true for many of the biggest most-widely-used compilers (except GCC) that have lots of developers working on them.
C compilers tend to either target a single platform only (like a vendor's compiler for a microcontroller) and were written as "the/a C implementation for this platform", or be very large projects like LLVM where including machine code generation isn't a big fraction of the compiler's own code size. Compilers for less widely used languages are more usually portable, but without wanting to write their own machine-code / object-file handling. (Many compilers these days are front-ends for LLVM, so get .o output for free, like rustc, but older compilers didn't have that option.)
Out of all compilers ever, most do go to asm. But if you weight by how often each one is used every day, going straight to a relocatable object file (.o / .obj) is significant fraction of the total builds done on any given day worldwide. i.e. the compiler you care about if you're reading this might well work this way.
Also, compilers like javac that target a portable bytecode format have less reason to use asm; the same output file and bytecode format work across every platform they have to run on.
Related:
https://retrocomputing.stackexchange.com/questions/14927/when-and-why-did-high-level-language-compilers-start-targeting-assembly-language on retrocomputing has some other answers about advantages of keeping as separate.
What is the need to generate ASM code in gcc, g++
What do C and Assembler actually compile to? - even compilers that go straight to machine code don't produce linked executables directly, they produce relocatable object files (.o or .obj). Except for tcc, the Tiny C Compiler, intended for use on the fly for one-file C programs.
Semi-related: Why do we even need assembler when we have compiler? asm is useful for humans to look at machine code, not as a necessary part of C -> machine code.
Why GCC does what it does
Yes, as is a separate program that the gcc front-end actually runs separately from cc1 (the C preprocessor+compiler that produces text asm).
This makes gcc slightly more modular, making the compiler itself a text -> text program.
GCC internally uses some binary data structures for GIMPLE and RTL internal representations, but it doesn't write (text representations of) those IR formats to files unless you use a special option for debugging.
So why stop at assembly? This means GCC doesn't need to know about different object file formats for the same target. For example, different x86-64 OSes use ELF, PE/COFF, MachO64 object files, and historically a.out. as assembles the same text asm into the same machine code surrounded by different object file metadata on different targets. (There are minor differences gcc has to know about, like whether to prepend an _ to symbol names or not, and whether 32-bit absolute addresses can be used, and whether code has to be PIC.)
Any platform-specific quirks can be left to GNU binutils as (aka GAS), or gcc can use the vendor-supplied assembler that comes with a system.
Historically, there were many different Unix systems with different CPUs, or especially the same CPU but different quirks in their object file formats. And more importantly, a fairly compatible set of assembler directives like .globl main, .asciiz "Hello World!\n", and similar. GAS syntax comes from Unix assemblers.
It really was possible in the past to port GCC to a new Unix platform without porting as, just using the assembler that comes with the OS.
Nobody has ever gotten around to integrating an assembler as a library into GCC's cc1 compiler. That's been done for the C preprocessor (which historically was also done in a separate process), but not the assembler.
Most other compilers do produce object files directly from the compiler, without a text asm temporary file / pipe. Often because the compiler was only designed for one or a couple targets, like MSVC or ICC or various compilers that started out as x86-only, or many vendor-supplied compilers for embedded chips.
clang/LLVM was designed much more recently than GCC. It was designed to work as an optimizing JIT back-end, so it needed a built-in assembler to make it fast to generate machine code. To work as an ahead-of-time compiler, adding support for different object-file formats was presumably a minor thing since the internal software architecture was there to go straight to binary machine code.
LLVM of course uses LLVM-IR internally for target-independent optimizations before looking for back-end-specific optimizations, but again it only writes out this format as text if you ask it to.
The assembler stage can be justified by two reasons:
it allows c/c++ code to be translated to a machine independent abstract assembler, from which there exists easy conversions to a multitude of different instruction set architectures
it takes out the burden of validating correct opcode, prefix, r/m, etc. instruction encoding for CISC architectures, when one can utilize an existing software [component].
The 1st edition of that book is from 2000, but is may as well talk about the early 90's, when c++ itself was translated to c and when the gnu/free software idea (including source code for compilers) was not really known.
EDIT: One of several nonsensical abstract machine independent languages used by GCC is RTL -- Register Transfer Language.
It's a matter of compiler implementation. Assembly code is an intermediate step between higher-level language (the one being compiled) and the resulting binary output. In general it's easier first to convert to assembly and after that to binary code instead of directly creating the binary code.
Gcc does create the assembly code as a temporary file, calls the assembler, and maybe the linker depending on what you do or dont add on the command line. That makes an object and then if enabled the binary, then all the temporary files are cleaned up. Use -save-temps to see what is really going on (there are a number of temporary files).
Running gcc without any options absolutely creates an asm file.
There is no "need" for this, it is simply how they happened to design it. I assume for multiple reasons, you will already want/need an assembler and linker before you start on a compiler (cart before the horse, asm on a processor before some other language). "The unix way" is to not re-invent tools or libraries, but just add a little on top, so that would imply going to asm then letting the assembler and linker do the rest. You dont have to re-invent so much of the assemblers job that way (multiple passes, resolving labels, etc). It is easier for a developer to debug ascii asm than bits. Folks have been doing it this way for generations of compilers. Just in time compilers are the primary exception to this habit, by definition they have to be able to go to machine code, so they do or can. Only recently though did llvm provide a way for the command line tools (llc) to go straight to object without stopping at asm (or at least it appears that way to the user).

Building minimal standalone executable with GCC

I have few programs (written in C) implementing some algorithms, that I use to measure computation time. Whole data is implemented as static libraries directly in code, there's no input and output from these programs. There's also no C library calls (no printfs etc.).
I want to build fully independent and minimal executable. I don't want to link my program with libgcc (target CPU has coprocessor, so I don't need to emulate floating point arithmetic), C library or any other. Actually I want to make my program as independent as it's possible. On Linux ELF program has to be linked only with crt0.o to run properly, right?
I'm mostly asking because I'm curious ;)
Link with gcc -nostdlib, then use objdump -h and strip --remove-section=... to really make it small by getting rid of silly things like the comment section and the exception handling frame information sections. Keep removing sections until it stops working.
And compile with -Os of course

Resources