Speed up embedded linux compilation process - performance

Have an embedded linux (OpenWrt) project for custom hardware. Any changes in kernel or application require full image or application recompiling. And recompiling is painfully slow.
To reduce this pain bought AMD Threadripper 3970X based work station with 128Gb RAM and 1Tb SSD. Testbenches for this CPU shows 120 second of linux kernel compilation time.
But I got bigger compilation time.
Full image compilation first time reduced from:
to:
Repeated image compilation reduced from:
to:
Package recompilation ($ time make package/tensorflow/compile) reduced from:
to:
E.g. compiling time reduced 2-7x.
During first image compilation all necessary source code to be downloaded from network. I have fast ethernet (100Mb/s) connection to not waist time for that.
I use RAMDISK:
$ sudo mkdir /mnt/ramdisk
$ sudo mount -t tmpfs -o rw,size=64G tmpfs /mnt/ramdisk
to store all sources, object and temporary files so no IO losses I believe.
make -j64 used to compile it. I see that all 64 cores loaded very rarely during compilation:
Mostly I see following:
or even this:
so I can't believe that faster compilation can't be achieved. Could someone give me hints/advices how to speed up GCC C/C++ cross compilation process. Some search points me to distcc and Parallel GCC but I doesn't have experience with it so not sure if this is what I need as OpenWrt has almost nothing in their manuals explaining how to speed up build process.

In linux, there is a concept of incremental build, so first time it will take time to build, but next time you need to build only the part which is changed or added extra. No need to rebuild everything. In that case build will be faster.
All the cores of the CPU will not be loaded all the times. It depends how many tasks are running currently. Suppose in your system, there are 8 cores but only 6 tasks are running. In that case all the cores will not be loaded fully.

Related

How to check IRQ latency in Linux (X86_64) for performance tuning?

Is there a way to check the interrupt processing latency in Linux kernel?
Or is there a way to check why CPU usage is only 40% in a specific configuration of Linux 4.19.138?
Background:
Currently I met a problem, I had a X86 server running either a 3rd party Linux-4.19.138 kernel (whose configuration file is about 6000 lines) or Ubuntu 20.04 X86_64 (whose configuration file is about 9500 lines long).
When running netperf test on this server , I found with the 3rd-party Linux-4.19.138 kernel, the IO latency of netperf is worse than with Ubuntu 20.04. The CPU usage is below 40% when running the 3rd party kernel, while it is about 100% when running Ubuntu 20.04.
They are using the same kernel command line and same performance profile in kernel runtime.
It seemed that the interrupt or the netserver process in the server is throttled in Linux-4.19.138.
Then, I rebuilt Ubuntu 20.04 kernel by using the short configuration file (6000 lines long), and got the similar bad results.
So it concluded that the kernel configuration made the difference.
Before comparing the 2 configurations (6000 lines vs 9500 lines), to narrow it down, my ask is, is there a way to check why CPU usage is only 40% in that configuration of 4.19.138? Or is there a way to check the interrupt processing latency in Linux kernel ?
I finally found the reason. It is from the
net.core.busy_read and
net.core.busy_poll are both to 0.
That means the socket polling is disabled, which impacts the netperf latency.
But the question changed to
In this case, the lower CPU usage is a sign that there is something different in Linux, what kind of tool or how can we should figure out what causes the CPU usage difference in 2 kernels?

How SCons cache works with different OS and CPU architectures?

Is SCons cache safe for different operating systems and CPU architectures?
Across different operating systems, sure, but on the same operating system across different CPU architectures, no, not by default. Last time I used SCons cache, (v2.0.1 of SCons) it was not safe across different CPU architectures. That was the reason we stopped using it at my current job. It can be made safe, by inserting the architecture into the build environment correctly, but it is difficult to get it to work right.
Unless every build machine on your network has the exact same hardware specs, I don't recommend using SCons cache, try getting clever with variant directories instead. That can at least save you from having to rebuild everything when changing build modes.

Linux Kernel Compilation speed up command

I am using Linux 3.18.25 on i5 (second gen) machine(dual boot windows and Linux). I am making some changes in kernel modules to get idea of the kernel code. The problem is, every time I compile my code using make command it takes 1 hour and 30 minutes approximately, even if I use make -j 4 command it takes almost same time. What should I do to compile the kernel code more quickly? Is there any other way to compile kernel other than using make or make -j 4 command?
Well I am not expert but from my experience:
set -J parameter same as your processors if you have 8 then make it
8, you can check from 'cat /proc/cpuinfo'
If its virtual machine make sure you have hyper enabled and
virtual machine is using more than one physical cpu core
Dont use toolchain and try to compile at the same target
architecture (i.e. if its amd64 then compile at amd 64 bit
machine)
**EDIT:
(Update from Andy comment) Check ccache and how its used in kernel compilation: http://linuxdeveloper.blogspot.de/2012/05/using-ccache-to-speed-up-kernel.html
Additional note: Also make sure you squeeze your CPU enough https://askubuntu.com/questions/523640/how-i-can-disable-cpu-frequency-scaling-and-set-the-system-to-performance
It all depends on the machine you are using, in order for j4 to work you need at least 4 cores. otherwise the jobs will just wait for each other (this looks exctly as you describe). try and compile on a multicore machine instead (I know this is not very helpful, but from my expiriance compiling kernels there is not much else you can do).
EDIT:
as it turns out I lived a very protected life so far. kernel compilation usually takes between 1-2 hour - exactly what you see.
BUT:
there is still things you can do, and they are all listed here
good luck

ATLAS installation from source

Im installing ATLAS in RHEL 6 with gcc 4.4.2 using
../configure -b 64 -Fa alg -fPIC --cc=/lib/gcc/64-bit/4.4.2/bin/gcc --prefix=/home/pkgs/atlas
I have a 8Gig Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz machine and it takes close to 5 hours just for "make build" is this a normal? is there a way to speed up the tune and build process ?
From the ATLAS installation guide:
This is the step where ATLAS performs all its empirical tuning, and then uses the discovered kernels to build all required libraries. It uses the BLDdir created by the configure step, and is invoked from the BLDdir with the make build command, or simply by make. This step can be quite long, depending on your platform and whether or not you use architectural defaults. For a system like the Core2Duo with architectural defaults, the build step may take 10 or 20 minutes, while in order to complete a full ATLAS search on a slower platform (eg. MIPS) could take anywhere between a couple of hours and a full day.
So yes, this behaviour is totally normal, because ATLAS performs extensive test to determine the best math kernels for your system.
And yes there is a way to speed up the build process by using the architectural defaults. Note however that this could result in inferior performance of your ATLAS installation.

Emulating a processor's (limited) resources, including clock speed

I would like a software environment in which I can test the speed of my software on hardware with specific resources. For example, how fast does this program run on an 800MHz x86 with 24 Mb of RAM, when my host hardware is a 3GHz quad core amd64 with 12GB of RAM? Emulators such as qemu make a great point of running "almost as fast" as the underlying hardware; I would like to make it run slower. Is there a way to do that?
I have never tried it, but perhaps you could achieve what you want to some extent by combining an emulator like QEMU or VirtualBox on Linux with something like this:
http://cpulimit.sourceforge.net/
If you can limit the CPU time available to the emulator you might be able to simulate the results of execution on a slower computer. Keep in mind, though, that this would only affect the execution speed (or so I hope, anyway).
The CPU instruction set and other system features would remain unchanged. This means that emulating a specific processor accurately would be difficult if not impossible.
In addition, using something like cpulimit, which works using SIGSTOP and SIGCONT to repeatedly stop/restart the emulator process might cause side-effects, such as timing inconsistencies, video display artifacts etc.
In your emulator, keep a virtual "clock" and increment it appropriately as you execute each instruction. From there you can simply report how long it took in virtual time to execute, or you can have your emulator sleep now and again to keep execution speed roughly where it would be in the target.

Resources