Compiling using g++ error: virtual memory exhausted: Cannot allocate memory - gcc

I'm compiling some software (nodejs, in this case) in a system with very limited memory (around 800mb usable) and the compilation is failing partway through due to hitting this memory limit, with the error message virtual memory exhausted: Cannot allocate memory.
Upgrading the system's memory is not an option, and I just need to be able to compile this software once on it.

I found a solution enabling me to continue the compilation, as initially mentioned on the debian wiki, passing g++ the flag --param ggc-min-expand=10 reduces its memory use, specifically by forcing gcc's garbage collector to run more often, as documented in the gcc optimization docs.
before re-running make, simply run
export CXXFLAGS="--param ggc-min-expand=10" (or export CXXFLAGS="$CXXFLAGS --param ggc-min-expand=10" to preserve any existing options you've set with the CXXFLAGS) to set that parameter on all invocations of g++ needed for the compilation.
You can potentially set the min-expand value even lower than 10 if needed, but it may further decrease the compilation speed, and it wasn't necessary in my case.

Related

working of “–print-memory-usage” in the GCC?

how does GCC provide a breakdown of the memory used in each memory region defined in the linker file using --print-memory-usage?
GCC just forwards --print-memory-usage to the linker, usually ld:
https://sourceware.org/binutils/docs-2.40/ld.html#index-memory-usage
gcc (or g++ for that matter) has no idea about memory usage, and the linker can only report usage of static storage memory, which is usually:
.text: The "program" or code to be executed. This might be located in RAM or in ROM (e.g. Flash) depending on options and architecture.
.rodata: Data in static storage that is read-only and does not need initialization at run-time. This is usually located in non-volatile memory like ROM or Flash; but there are exceptions, one of which is avr-gcc.
.data, .bss and COMMON: Data in RAM that's initialized during start-up by the CRT (C Runtime).
Apart from these common setcions, there might be other sections like .init*, .boot, .jumptables etc, which again depend on the application and architecture.
By its very nature, the linker (or assembler or compiler) cannot determine memory usage that unfolds at run-time, which is:
Stack usage: Non-static local variables that cannot be held in registers, alloca, ...
Heap usage: malloc, new and friends.
What the compiler can do for you is -fstack-usage and similar, which generates a text file *.su for each translation unit. The compiler reports stack usage that's known at compile time (static) and unknown stack usage that arises at run-time (dynamic). The functions marked as static use the specified amount of stack space, without counting the usages of non-inlined callees.
In order to know the complete stack usage (or a reliable upper bound), the dynamic call graph must be known. Even if it's known, GCC won't do the analysis four you. You will need other more elaborate tools to work out these metrics, e.g. by abstract interpretation or other means of static analysis.
Notice that data collected at run-time, like dynamic stack usage analysis at run time, only provide a lower bound of memory usage (or execution time for that matter). However, for sound analysis like in safety-scitical apps, what you meed are upper bounds.

compiler options to increase optimization performance of the code

I am porting the code from INTEL architecture to ARM architecture. The same code I am trying to build with arm cross compilers on centos. Are there any compiler options that increase the performance of executable image because the image on INTEL might not give similar performance on ARM. Is there a way to achieve this?
A lot of optimization options exist in GCC. By default the compiler tries to make the compilation process as short as possible and produce an object code that makes debugging easy. However, gcc also provides several options for optimization.
Four general levels of incremental optimizations for performance are available in GCC and can be activated by passing one of the options -O0, -O1, -O2 or -O3. Each of these levels activates a set of optimizations that can be also activated manually by specifying the corresponding command line option. For instance with -O1 the compiler performs branches using the decrement and branch instruction (if available and applicable) rather then decrementing the register, comparing it to zero and branching in separate instructions. This decrement and branch can be also specified manually by passing the -fbranch-count-reg option. Consider that optimizations performed at each level also depend on the target architecture, you can get the list of available and enabled optimization by running GCC with the -Q --help=optimizers option.
Generally speaking the levels of optimization correspond to (notice that at each level also the optimization of the previous ones are applied):
-O0: the default level, the compiler tries to reduce compilation time and produce an object code that can be easily processed by a debugger
-O1: the compiler tries to reduce both code size and execution time. Only optimizations that do not take a lot of compile time are performed. Compilation may take considerably more memory.
-O2: the compiler applies almost all optimizations available that do not affect the code size. Compiling takes more but performance should improve.
-O3: the compiler applies also optimization that might increase the code size (for instance inlining functions)
For a detailed description of all optimization options you can have a look here.
As a general remark consider that compiler optimization are designed to work in the general case but their effectiveness will depend a lot on both you program and the architecture you are running it on.
Edit:
If you are interested in memory paging optimization there is the -freorder-blocks-and-partition option (activated also with -O2). This option reorders basic blocks inside each function to partition them in hot blocks (called frequently) and cold blocks (called rarely). Hot blocks are then placed in contiguous memory locations. This should increasing cache locality and paging performance.

How is -march different from -mtune?

I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.

Why does GCC drop the frame pointer on 64-bit?

What's the rationale behind dropping the frame pointer on 64-bit architectures by default? I'm well aware that it can be enabled but why does GCC disable it in the first place while having it enabled for 32-bit? After all, 64-bit has more registers than 32-bit CPUs.
Edit:
Looks like the frame pointer will be also dropped for x86 when using a more recent GCC version. From the manual:
Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86 targets has been changed to -fomit-frame-pointer. The default can be reverted to -fno-omit-frame-pointer by configuring GCC with the --enable-frame-pointer configure option.
But why?
For x86-64, the ABI (PDF) encourages the absence of a frame pointer. The rationale is more or less "we have DWARF now, so it's not necessary for debugging or exception unwinding; if we make it optional from day one, then no software will come to depend on its existence."
x86-64 does have more registers than x86-32, but it still doesn't have enough. Freeing up more general-purpose registers is always a Good Thing from a compiler's point of view. The operations that require a stack crawl are slower, yes, but they are rare events, so it's a good tradeoff for shaving a few cycles off every subroutine call plus fewer stack spills.

When should I use GCC's -pipe option?

The GCC 4.1.2 documentation has this to say about the -pipe option:
-pipe
Use pipes rather than temporary files for communication between the various stages of compilation. This fails to work on some systems where the assembler is unable to read from a pipe; but the GNU assembler has no trouble.
I assume I'd be able to tell from error message if my systems' assemblers didn't support pipes, so besides that issue, when does it matter whether I use that option? What factors should go into deciding to use it?
In our experience with a medium-sized project, adding -pipe made no discernible difference in build times. We ran into a couple of problems with it (sometimes failing to delete intermediate files if an error was encountered, IIRC), and so since it wasn't gaining us anything, we quit using it rather than trying to troubleshoot those problems.
It doesn't usually make any difference
It has + and - considerations. Historically, running the compiler and assembler simultaneously would stress RAM resources.
Gcc is small by today's standards and -pipe adds a bit of multi-core accessible parallel execution.
But by the same token the CPU is so fast that it can create that temporary file and read it back without you even noticing. And since -pipe was never the default mode, it occasionally acts up a little. A single developer will generally report not noticing the time difference.
Now, there are some large projects out there. You can check out a single tree that will build all of Firefox, or NetBSD, or something like that, something that is really big. Something that includes all of X, say, as a minor subsystem component. You may or may not notice a difference when the job involves millions of lines of code in thousands and thousands of C files. As I'm sure you know, people normally work on only a small part of something like this at one time. But if you are a release engineer or working on a build server, or changing something in stdio.h, you may well want to build the whole system to see if you broke anything. And now, every drop of performance probably counts...
Trying this out now, it looks to be moderately faster to build when the source / build destinations are on NFS (linux network). Memory usage is high though. If you never fill the RAM and have source on NFS, seems like a win with -pipe.
Honestly there is very little reason to not use it. -pipe will only use a tad more ram, which if this box is building code, I'd assume has a decent amount. It can significantly improve build time if your system is using a more conservative filesystem that writes and then deletes all the temporary files along the way (ext3, for example.)
One advantage is that with -pipe will the compiler interact less with a file system. Even when it is a ram disk does the data still need to go through the block I/O and file system layers when using temporary files whereas with a pipe it becomes a bit more direct.
With files does the compiler first need to finish writing before it can call the assembler. Another advantage of pipes is that both, the compiler and assembler, can run at the same time and it is making a bit more use of SMP architectures. Especially when the compiler needs to wait for the data from the source file (because of blocking I/O calls) can the operating system give the assembler full CPU time and let it do its job faster.
From a hardware point of view I guess you would use -pipe to preserve the lifetime of your hard drive.

Resources