Speedup GNU make build process - Parallelism? - gcc

I build a huge project frequently and this takes long time (more than one hour) to finish even after configuring pre-compiled headers. Are their any guidelines or tricks to allow make work in parallel (e.g. starting gcc in background, ...etc) to allow for faster builds?
Note: Sources and binaries are too large in size to be placed in a ram file system and I don't want to change the directory structure or build philosophy.

You can try
make -j<number of jobs to run in parallel>

make -jN is a must now that most machines are multi-core. If you don't want to write -jN each time, you can put
export MAKEFLAGS=-jN
in your .bashrc.
You may also want to checkout distcc.

If your project is becoming too big for one machine to handle, you can use one of the distributed make replacements, such as Electric Cloud.

If you want to run your build in parallel,
make -jN
does the job, but keep in mind:
N should be equal to the maximum number of threads your machine supports, if you enter a number greater than that, make automatically makes N=maximum number of threads your machine supports
make doesn't support parallel build using -jN in MSDOS, it just does a serial build. If you specify -jN, it will downgrade N=1.
Read more here, from the make source: http://cmdlinelinux.blogspot.com/2014/04/parallel-build-using-gnu-make-j.html

Related

Ocaml parallelize builds

Let's say I want to build a program that compiles Ocaml source code in parallel. I actually want to understand how to accomplish this with Ocaml today. So given the current state of Ocaml today, how do I parallelize parts of my program?
Should I just spawn new processes with the Unix module?
In the case of parallel compilation, does this have any overhead/performance impact?
To compile files you have to run the ocaml compiler, one for each file. So yes, you have to have to start new processes and the Unix module has the necessary functionality for that.
As for overhead or performance impacts consider this:
1) You need to start one process per file you compile. Weather you do that sequentially or parallel the number of processes started remains the same.
2) Compiling a file takes long compared to the necessary bookkeeping of starting each compile, even if you consider parallel compilations.
So are you worried about the overhead of starting and tracking multiple compiler processes in parallel? Don't be. That's less than 0.1% of your time. On the other hand utilizing 2 cpu cores by running 2 compilers will basically double your speed.
Or are you worried about multiple compilers running in parallel? You need twice the ram and on most modern cpus caches are shared so cache performance will suffer too to some extend. But unless you are working on some embedded system that is absolutely memory starved that won't be a problem. So again the benefit of using multiple cores far outweigh the drawbacks.
Just follow the simple rule for running things parallel: Run one job per CPU core with maybe one extra to smooth out IO waits. More jobs will have them fight for CPU time without benefits.

Faster way to compile Factor Programs

I really love the Factor language. But I find that compiling programs written in it is incredibly slow, and thus it's not feasible to create real projects with Factor.
For example, compiling the sample Calculator WebApp takes about 5 minutes on my laptop (i3 processor, 2GB RAM, running Fedora 15).
I've searched around but couldn't find a faster way to compile Factor programs than using the interpreter (the main factor binary executable).
It becomes ridiculous when you attempt to only use the interpreter for each run and not "deploy" your program to a native binary file (which doesn't even work on many programs).
It means that every time I want to run the Calculator, for example, I have to wait a 5 minute cold-start duration.
I'd like to know whether this is a common issue, and whether there's a good way to tackle it.
I admit that before today, I had never heard of Factor. I took the opportunity to play with it. It is looking nice (reminds me of squeak-vm and lisp at the same time). I'll cut the smalltalk (pun very much intended) and jump to your question.
Analysis
It appears that the way Factor works, results in loading vocabularies being slow.
I compiled Factor on my 64 bit quadcore linux system (from git revision 60b1115452, Thu Oct 6). Putting everything on tmpfs the build dir takes 641Mb, of which 2x114Mb is in the factor.image and its backup (factor.image.fresh).
When strace-ing the calculator app loading, there is a huge list of factor files being loaded:
3175 factor files are touched.
compilation of these takes roughly 30 seconds on my box
the memory usage maxes out on just under 500Mb (virtual) and 300Mb reserved set
I'm strongly suspecting your box is low on memory, and might be getting very swappy This would definitely explain compilation taking 5 minutes
Can you confirm whether this is the case (likely if you are running some kind of shared host or VPS appliance). If you run a virtual machine, consider increasing the available system memory.
Saving Heap Images (snapshots)
I already mentioned the factor.image file (114Mb) before. It contains a 'precompiled' (bootstrapped, actually) heap image for the Factor VM. All operations in the VM (working on the UI listener or compiling factor files) affects the heap image.
To avoid having to recompile your source files time and time again, you can save the end-result into a custom heap image:
http://docs.factorcode.org/content/article-images.html
Images
To start Factor with a custom image, use the -i=image command line switch; see Command line switches for the VM.
One reason to save a custom image is if you find yourself loading the same libraries in every Factor session; some libraries take a little while to compile, so saving an image with those libraries loaded can save you a lot of time.
For example, to save an image with the web framework loaded,
USE: furnace
save
New images can be created from scratch: Bootstrapping new images
Deploying applications
Saving heap images results in files that will (typically) be bigger than the original bootstrap image.
The Application deployment tool creates stripped-down images containing just enough code to run a single application
The stand-alone application deployment tool, implemented in the tools.deploy vocabulary, compiles a vocabulary down to a native executable which runs the vocabulary's MAIN: hook. Deployed executables do not depend on Factor being installed, and do not expose any source code, and thus are suitable for delivering commercial end-user applications.
Most of the time, the words in the tools.deploy vocabulary should not be used directly; instead, use Application deployment UI tool.
You must explicitly specify major subsystems which are required, as well as the level of reflection support needed. This is done by modifying the deployment configuration prior to deployment.
Concluding
I expect you'll benefit from (in order of quickest win):
increasing available RAM (only quick in virtual environments...)
saving a heap image with
USE: db.sqlite
USE: furnace.alloy
USE: namespaces
USE: http.server
save
This step brought the compilation on my system down from ~30s to 0.835s
deploying the calculator webapp to a stripped heap image (refer to the source for deployment hints)
In short, thanks for bringing Factor to my attention, and I hope my findings will be of any help, Cheers

Is there a parallel make system that is smart enough to intelligently respond to low-memory/swapping conditions?

I'm a big fan of speeding up my builds using "make -j8" (replacing 8 with whatever my current computer's number of cores is, of course), and compiling N files in parallel is usually very effective at reducing compile times... unless some of the compilation processes are sufficiently memory-intensive that the computer runs out of RAM, in which case all the various compile processes start swapping each other out, and everything slows to a crawl -- thus defeating the purpose of doing a parallel compile in the first place.
Now, the obvious solution to this problem is "buy more RAM" -- but since I'm too cheap to do that, it occurs to me that it ought to be possible to have an implementation of 'make' (or equivalent) that watches the system's available RAM, and when RAM gets down to near zero and the system starts swapping, make would automatically step in and send a SIGSTOP to one or more of the compile processes it had spawned. That would allow the stopped processes to get fully swapped out, so that the other processes could finish their compile without further swapping; then, when the other processes exit and more RAM becomes available, the 'make' process would send a SIGCONT to the paused processes, allowing them to resume their own processing. That way most swapping would be avoided, and I could safely compile on all cores.
Is anyone aware of a program that implements this logic? Or conversely, is there some good reason why such a program wouldn't/couldn't work?
For GNU Make, there's the -l option:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-
point number). With no argument, removes a previous load limit.
I don't think there's a standard option for this, though.

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

When should I use GCC's -pipe option?

The GCC 4.1.2 documentation has this to say about the -pipe option:
-pipe
Use pipes rather than temporary files for communication between the various stages of compilation. This fails to work on some systems where the assembler is unable to read from a pipe; but the GNU assembler has no trouble.
I assume I'd be able to tell from error message if my systems' assemblers didn't support pipes, so besides that issue, when does it matter whether I use that option? What factors should go into deciding to use it?
In our experience with a medium-sized project, adding -pipe made no discernible difference in build times. We ran into a couple of problems with it (sometimes failing to delete intermediate files if an error was encountered, IIRC), and so since it wasn't gaining us anything, we quit using it rather than trying to troubleshoot those problems.
It doesn't usually make any difference
It has + and - considerations. Historically, running the compiler and assembler simultaneously would stress RAM resources.
Gcc is small by today's standards and -pipe adds a bit of multi-core accessible parallel execution.
But by the same token the CPU is so fast that it can create that temporary file and read it back without you even noticing. And since -pipe was never the default mode, it occasionally acts up a little. A single developer will generally report not noticing the time difference.
Now, there are some large projects out there. You can check out a single tree that will build all of Firefox, or NetBSD, or something like that, something that is really big. Something that includes all of X, say, as a minor subsystem component. You may or may not notice a difference when the job involves millions of lines of code in thousands and thousands of C files. As I'm sure you know, people normally work on only a small part of something like this at one time. But if you are a release engineer or working on a build server, or changing something in stdio.h, you may well want to build the whole system to see if you broke anything. And now, every drop of performance probably counts...
Trying this out now, it looks to be moderately faster to build when the source / build destinations are on NFS (linux network). Memory usage is high though. If you never fill the RAM and have source on NFS, seems like a win with -pipe.
Honestly there is very little reason to not use it. -pipe will only use a tad more ram, which if this box is building code, I'd assume has a decent amount. It can significantly improve build time if your system is using a more conservative filesystem that writes and then deletes all the temporary files along the way (ext3, for example.)
One advantage is that with -pipe will the compiler interact less with a file system. Even when it is a ram disk does the data still need to go through the block I/O and file system layers when using temporary files whereas with a pipe it becomes a bit more direct.
With files does the compiler first need to finish writing before it can call the assembler. Another advantage of pipes is that both, the compiler and assembler, can run at the same time and it is making a bit more use of SMP architectures. Especially when the compiler needs to wait for the data from the source file (because of blocking I/O calls) can the operating system give the assembler full CPU time and let it do its job faster.
From a hardware point of view I guess you would use -pipe to preserve the lifetime of your hard drive.

Resources