GNU make: should the number of jobs equal the number of CPU cores in a system? - makefile

There seems to be some controversy on whether the number of jobs in GNU make is supposed to be equal to the number of cores, or if you can optimize the build time by adding one extra job that can be queued up while the others "work".
Is it better to use -j4 or -j5 on a quad core system?
Have you seen (or done) any benchmarking that supports one or the other?

I've run my home project on my 4-core with hyperthreading laptop and recorded the results. This is a fairly compiler-heavy project but it includes a unit test of 17.7 seconds at the end. The compiles are not very IO intensive; there is very much memory available and if not the rest is on a fast SSD.
1 job real 2m27.929s user 2m11.352s sys 0m11.964s
2 jobs real 1m22.901s user 2m13.800s sys 0m9.532s
3 jobs real 1m6.434s user 2m29.024s sys 0m10.532s
4 jobs real 0m59.847s user 2m50.336s sys 0m12.656s
5 jobs real 0m58.657s user 3m24.384s sys 0m14.112s
6 jobs real 0m57.100s user 3m51.776s sys 0m16.128s
7 jobs real 0m56.304s user 4m15.500s sys 0m16.992s
8 jobs real 0m53.513s user 4m38.456s sys 0m17.724s
9 jobs real 0m53.371s user 4m37.344s sys 0m17.676s
10 jobs real 0m53.350s user 4m37.384s sys 0m17.752s
11 jobs real 0m53.834s user 4m43.644s sys 0m18.568s
12 jobs real 0m52.187s user 4m32.400s sys 0m17.476s
13 jobs real 0m53.834s user 4m40.900s sys 0m17.660s
14 jobs real 0m53.901s user 4m37.076s sys 0m17.408s
15 jobs real 0m55.975s user 4m43.588s sys 0m18.504s
16 jobs real 0m53.764s user 4m40.856s sys 0m18.244s
inf jobs real 0m51.812s user 4m21.200s sys 0m16.812s
Basic results:
Scaling to the core count increases the performance nearly linearly. The real time went down from 2.5 minutes to 1.0 minute (2.5x as fast), but the time taken during compile went up from 2.11 to 2.50 minutes. The system noticed barely any additional load in this bit.
Scaling from the core count to the thread count increased the user load immensely, from 2.50 minutes to 4.38 minutes. This near doubling is most likely because the other compiler instances wanted to use the same CPU resources at the same time. The system is getting a bit more loaded with requests and task switching, causing it to go to 17.7 seconds of time used. The advantage is about 6.5 seconds on a compile time of 53.5 seconds, making for a 12% speedup.
Scaling from thread count to double thread count gave no significant speedup. The times at 12 and 15 are most likely statistical anomalies that you can disregard. The total time taken increases ever so slightly, as does the system time. Both are most likely due to increased task switching. There is no benefit to this.
My guess right now: If you do something else on your computer, use the core count. If you do not, use the thread count. Exceeding it shows no benefit. At some point they will become memory limited and collapse due to that, making the compiling much slower. The "inf" line was added at a much later date, giving me the suspicion that there was some thermal throttling for the 8+ jobs. This does show that for this project size there's no memory or throughput limit in effect. It's a small project though, given 8GB of memory to compile in.

I would say the best thing to do is benchmark it yourself on your particular environment and workload. Seems like there are too many variables (size/number of source files, available memory, disk caching, whether your source directory & system headers are located on different disks, etc.) for a one-size-fits-all answer.
My personal experience (on a 2-core MacBook Pro) is that -j2 is significantly faster than -j1, but beyond that (-j3, -j4 etc.) there's no measurable speedup. So for my environment "jobs == number of cores" seems to be a good answer. (YMMV)

I, personally, use make -j n where n is "number of cores" + 1.
I can't, however, give a scientific explanation: I've seen a lot of people using the same settings and they gave me pretty good results so far.
Anyway, you have to be careful because some make-chains simply aren't compatible with the --jobs option, and can lead to unexpected results. If you're experiencing strange dependency errors, just try to make without --jobs.

Both are not wrong. To be at peace with yourself and with author of software you're compiling (different multi-thread/single-thread restrictions apply at software level itself), I suggest you use:
make -j`nproc`
Notes: nproc is linux command that will return number of cores/threads(modern CPU) available on system. Placing it under ticks ` like above will pass the number to the make command.
Additional info: As someone mentioned, using all cores/threads to compile software can literally choke your box to near death (being unresponsive) and might even take longer than using less cores. As I seen one Slackware user here posted he had dual core CPU but still provided testing up to j 8, which stopped being different at j 2 (only 2 hardware cores that CPU can utilize). So, to avoid unresponsive box i suggest you run it like this:
make -j`nproc --ignore=2`
This will pass the output of nproc to make and subtract 2 cores from its result.

Ultimately, you'll have to do some benchmarks to determine the best number to use for your build, but remember that the CPU isn't the only resource that matters!
If you've got a build that relies heavily on the disk, for example, then spawning lots of jobs on a multicore system might actually be slower, as the disk will have to do extra work moving the disk head back and forth to serve all the different jobs (depending on lots of factors, like how well the OS handles the disk-cache, native command queuing support by the disk, etc.).
And then you've got "real" cores versus hyper-threading. You may or may not benefit from spawning jobs for each hyper-thread. Again, you'll have to benchmark to find out.
I can't say I've specifically tried #cores + 1, but on our systems (Intel i7 940, 4 hyperthreaded cores, lots of RAM, and VelociRaptor drives) and our build (large-scale C++ build that's alternately CPU and I/O bound) there is very little difference between -j4 and -j8. (It's maybe 15% better... but nowhere near twice as good.)
If I'm going away for lunch, I'll use -j8, but if I want to use my system for anything else while it's building, I'll use a lower number. :)

I just got an Athlon II X2 Regor proc with a Foxconn M/B and 4GB of G-Skill memory.
I put my 'cat /proc/cpuinfo' and 'free' at the end of this so others can see my specs. It's a dual core Athlon II x2 with 4GB of RAM.
uname -a on default slackware 14.0 kernel is 3.2.45.
I downloaded the next step kernel source (linux-3.2.46) to /archive4;
extracted it (tar -xjvf linux-3.2.46.tar.bz2);
cd'd into the directory (cd linux-3.2.46);
and copied the default kernel's config over (cp /usr/src/linux/.config .);
used make oldconfig to prepare the 3.2.46 kernel config;
then ran make with various incantations of -jX.
I tested the timings of each run by issuing make after the time command, e.g.,
'time make -j2'. Between each run I 'rm -rf' the linux-3.2.46 tree and reextracted it, copied the default /usr/src/linux/.config into the directory, ran make oldconfig and then did my 'make -jX' test again.
plain "make":
real 51m47.510s
user 47m52.228s
sys 3m44.985s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j2
real 27m3.194s
user 48m5.135s
sys 3m39.431s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j3
real 27m30.203s
user 48m43.821s
sys 3m42.309s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j4
real 27m32.023s
user 49m18.328s
sys 3m43.765s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j8
real 28m28.112s
user 50m34.445s
sys 3m49.877s
bob#Moses:/archive4/linux-3.2.46$
'cat /proc/cpuinfo' yields:
bob#Moses:/archive4$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.91
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.94
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
'free' yields:
bob#Moses:/archive4$ free
total used free shared buffers cached
Mem: 3991304 3834564 156740 0 519220 2515308

Just as a ref:
From Spawning Multiple Build Jobs section in LKD:
where n is the number of jobs to spawn. Usual practice is to spawn one or two jobs per processor. For example, on a dual processor machine, one might do
$ make j4

From my experience, there must be some performance benefits when adding extra jobs.
It is simply because disk I/O is one of the bottle necks besides CPU. However it is not easy to decide on the number of extra jobs as it is highly inter-connected with the number of cores and types of the disk being used.

Many years later, the majority of these answers are still correct. However, there has been a bit of a change: Using more jobs than you have physical cores now gives a genuinely significant speedup. As an addendum to Dascandy's table, here's my times for compiling a project on a AMD Ryzen 5 3600X on linux. (The Powder Toy, commit c6f653ac3cef03acfbc44e8f29f11e1b301f1ca2)
I recommend checking yourself, but I've found with input from others that using your logical core count for job count works well on Zen. Alongside that, the system does not seem to lose responsiveness. I imagine this applies to recent Intel CPUs as well. Do note I have an SSD, as well, so it may be worth it to test your CPU yourself.
scons -j1 --release --native 120.68s user 9.78s system 99% cpu 2:10.60 total
scons -j2 --release --native 122.96s user 9.59s system 197% cpu 1:07.15 total
scons -j3 --release --native 125.62s user 9.75s system 292% cpu 46.291 total
scons -j4 --release --native 128.26s user 10.41s system 385% cpu 35.971 total
scons -j5 --release --native 133.73s user 10.33s system 476% cpu 30.241 total
scons -j6 --release --native 144.10s user 11.24s system 564% cpu 27.510 total
scons -j7 --release --native 153.64s user 11.61s system 653% cpu 25.297 total
scons -j8 --release --native 161.91s user 12.04s system 742% cpu 23.440 total
scons -j9 --release --native 169.09s user 12.38s system 827% cpu 21.923 total
scons -j10 --release --native 176.63s user 12.70s system 910% cpu 20.788 total
scons -j11 --release --native 184.57s user 13.18s system 989% cpu 19.976 total
scons -j12 --release --native 192.13s user 14.33s system 1055% cpu 19.553 total
scons -j13 --release --native 193.27s user 14.01s system 1052% cpu 19.698 total
scons -j14 --release --native 193.62s user 13.85s system 1076% cpu 19.270 total
scons -j15 --release --native 195.20s user 13.53s system 1056% cpu 19.755 total
scons -j16 --release --native 195.11s user 13.81s system 1060% cpu 19.692 total
( -jinf test not included, as it is not supported by scons.)
Tests done on Ubuntu 19.10 w/ a Ryzen 5 3600X, Samsung 860 Evo SSD (SATA), and 32GB RAM
Final note: Other people with a 3600X may get better times than me. When doing this test, I had Eco mode enabled, reducing the CPU's speed a little.

YES! On my 3950x, I run -j32 and it saves hours of compile time! I can still watch youtube, browse the web, etc. during compile without any difference. The processor isn't always pegged even with a 1TB 970 PRO nvme or 1TB Auros Gen4 nvme and 64GB of 3200C14. Even when it is, I don't notice UI wise. I plan on testing with -j48 in the near future on some big upcoming projects. I expect, as you probably do, to see some impressive improvement. Those still with a quad-core might not get the same gains....
Linus himself just upgraded to a 3970x and you can bet your bottom dollar, he is at least running -j64.

Related

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.
numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.
You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).
Where is OpenMP library getting invoked in binding threads to physical cores?
AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.
If you want to be safe and portable, use the following configuration (using Bash):
OMP_NUM_THREADS=4
OMP_PROC_BIND=TRUE
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.

Why does `native_write_msr` dominate my profiling result?

I have a program that runs on a multi-thread framework with Linux kernel 4.18 and Intel CPU. I ran perf record -p pid -g -e cycles:u --call-graph lbr -F 99 -- sleep 20 to collect stack trace and generate flame graph.
My program was running under a low workload, so the time spent on futex_wait is expected. But the top of the stack is a kernel function native_write_msr. According to What does native_write_msr in kernel do? and https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/msr.h#L103, this function is used for performance counters. I have disabled the tracepoint in native_write_msr.
And pidstat -p pid 1 told me that the system CPU usage is quite low.
05:44:34 PM UID PID %usr %system %guest %CPU CPU Command
05:44:35 PM 1001 67441 60.00 4.00 0.00 64.00 11 my_profram
05:44:36 PM 1001 67441 58.00 7.00 0.00 65.00 11 my_profram
05:44:37 PM 1001 67441 61.00 3.00 0.00 64.00 11 my_profram
My questions are
Why does native_write_msr appear so many times in the stack traces (as a result, it occupies a large space in the flame graph for about 80%). Is it a block operation, or it realeases the CPU when called?
Why is the system CPU usage relatively low against the frame graph? According to the graph, 80% of the CPU time should belong to %system instead of %usr.
Any help is appreciated. If I miss any useful infomation, please comment.
Thank you very much!
From the flamegraph, you could find that native_write_msr function is called by the function schedule. When a running process is removed from one core (because it's migrated to another core or stopped by the scheduler to run another process), the scheduler need to dump the process's perf data and clean its perf configurations, so we don't mess up perf data of different processes. The scheduler may need to write to msr in this step, thus calling native_write_msr. So native_write_msr are called for so many times because scheduling or core migrations happens too frequently.

Analyzing cause of performance regression with different kernel version

I have come across a strange performance regression from Linux kernel 3.11 to 3.12 on x86_64 systems.
Running Mark Stock's Radiance benchmark on Fedora 20, 3.12 is noticeably slower. Nothing else is changed - identical binary, identical glibc - I just boot a different kernel version, and the performance changes.
The timed program, rpict, is 100% CPU bound user-level code.
Before I report this as a bug, I'd like to find the cause for this behavior. I don't know a lot about the Linux kernel, and the change log from 3.11 to 3.12 does not give me any clue.
I observed this on two systems, an Intel Haswell (i7-4771) and an AMD Richland (A8-6600K).
On the Haswell system user time went from 895 sec with 3.11 to 962 with 3.12. On the Richland, from 1764 to 1844. These times are repeatable to within a few seconds.
I did some profiling with perf, and found that IPC went down in the same proportion as the slowdown. On the Haswell system, this seems to be caused by more missed branches, but why should the prediction rate go down? Radiance does use the random number generator - could "better" randomness cause the missed branches? But apart from OMAP4 support, the RNG does not have to seem changed in 3.12.
On the AMD system, perf just points to more idle backend cycles, but the cause is not clear.
Haswell system:
3.11.10 895s user, 3.74% branch-misses, 1.65 insns per cycle
3.12.6 962s user, 4.22% branch-misses, 1.52 insns per cycle
Richland system:
3.11.10 1764s user, 8.23% branch-misses, 0.75 insns per cycle
3.12.6 1844s user, 8.26% branch-misses, 0.72 insns per cycle
I also looked at a diff from the dmesg output of both kernels, but did not see anything that might have caused such a slowdown of a CPU-bound program.
I tried switching the cpufreq governor from the default ondemand to peformance but that did not have any effect.
The executable was compiled using gcc 4.7.3 but not using AVX instructions. libm still seems to use some AVX (e.g. __ieee754_pow_fma4) but these functions are only 0.3% of total execution time.
Additional info:
Diff of kernel configs
diff of the dmesg outputs on the Haswell system.
diff of /proc/pid/maps - 3.11 maps only one heap region; 3.12 lots.
perf stat output from the A8-6600K system
perf stats w/ TLB misses dTLB stats look very different!
/usr/bin/time -v output from the A8-6600K system
Any ideas (apart from bisecting the kernel changes)?
Let's check your perf stat outputs: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt
Kernel 3.11.10
1805057.522096 task-clock # 0.999 CPUs utilized
183,822 context-switches # 0.102 K/sec
109 cpu-migrations # 0.000 K/sec
40,451 page-faults # 0.022 K/sec
7,523,630,814,458 cycles # 4.168 GHz [83.31%]
628,027,409,355 stalled-cycles-frontend # 8.35% frontend cycles idle [83.34%]
2,688,621,128,444 stalled-cycles-backend # 35.74% backend cycles idle [33.35%]
5,607,337,995,118 instructions # 0.75 insns per cycle
# 0.48 stalled cycles per insn [50.01%]
825,679,208,404 branches # 457.425 M/sec [66.67%]
67,984,693,354 branch-misses # 8.23% of all branches [83.33%]
1806.804220050 seconds time elapsed
Kernel 3.12.6
1875709.455321 task-clock # 0.999 CPUs utilized
192,425 context-switches # 0.103 K/sec
133 cpu-migrations # 0.000 K/sec
40,356 page-faults # 0.022 K/sec
7,822,017,368,073 cycles # 4.170 GHz [83.31%]
634,535,174,769 stalled-cycles-frontend # 8.11% frontend cycles idle [83.34%]
2,949,638,742,734 stalled-cycles-backend # 37.71% backend cycles idle [33.35%]
5,607,926,276,713 instructions # 0.72 insns per cycle
# 0.53 stalled cycles per insn [50.01%]
825,760,510,232 branches # 440.239 M/sec [66.67%]
68,205,868,246 branch-misses # 8.26% of all branches [83.33%]
1877.263511002 seconds time elapsed
There are almost 300 Gcycles more for 3.12.6 in the "cycles" field; and only 6,5 Gcycles were stalls of frontend and 261 Gcycles were stalled in the backend. You have only 0,2 G of additional branch misses (each cost about 20 cycles - per optim.manual page 597; so 4Gcycles), so I think that your performance problems are related to memory subsystem problems (more realistict backend event, which can be influenced by kernel). Pagefaults diffs and migration count are low, and I think they will not slowdown test directly (but migrations may move program to the worse place).
You should go deeper into perf counters to find the exact type of problem (it will be easier if you have shorter runs of test). The Intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf will help you. Check page 587 (B.3.2) for overall events hierarchy (FE and BE stalls are here too), B.3.2.1-B.3.2.3 for info on backend stalls and how to start digging (checks for cache events, etc) and below.
How can kernel influence the memory subsystem? It can setup different virtual-to-physical mapping (hardly the your case), or it can move process farther from data. You have not-NUMA machine, but Haswell is not the exact UMA - there is a ring bus and some cores are closer to memory controller or to some parts of shared LLC (last level cache). You can test you program with taskset utility, bounding it to some core - kernel will not move it to other core.
UPDATE: After checking your new perf stats from A8 we see that there are more DLTB-misses for 3.12.6. With changes in /proc/pid/maps (lot of short [heap] sections instead of single [heap], still no exact info why), I think that there can be differences in transparent hugepage (THP; with 2M hugepages there are less TLB entries needed for the same amount of memory and less tlb misses), for example in 3.12 it can't be applied due to short heap sections.
You can check your /proc/PID/smaps for AnonHugePages and /proc/vmstat for thp* values to see thp results. Values are documented here kernel.org/doc/Documentation/vm/transhuge.txt
#osgx You found the cause! After echo never > /sys/kernel/mm/transparent_hugepage/enabled, 3.11.10 takes as long as 3.12.6!
Good news!
Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated
I was wrong, this multi-heap section effect is not the brk randomisation (which changes only beginning of the heap). This is failure of VMA merging in do_brk; don't know why, but some changes for VM_SOFTDIRTY were seen in mm between 3.11.10 - 3.12.6.
UPDATE2: Possible cause of not-merging VMA:
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2580 do_brk in 3.11
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2577 do_brk in 3.12
3.12 just added at the end of do_brk
2663 vma->vm_flags |= VM_SOFTDIRTY;
2664 return addr;
And bit above we have
2635 /* Can we just expand an old private anonymous mapping? */
2636 vma = vma_merge(mm, prev, addr, addr + len, flags,
2637 NULL, NULL, pgoff, NULL);
and inside vma_merge there is test for vm_flags
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L994 3.11
http://lxr.missinglinkelectronics.com/linux+v3.12/mm/mmap.c#L994 3.12
1004 /*
1005 * We later require that vma->vm_flags == vm_flags,
1006 * so this tests vma->vm_flags & VM_SPECIAL, too.
1007 */
vma_merge --> can_vma_merge_before --> is_mergeable_vma ...
898 if (vma->vm_flags ^ vm_flags)
899 return 0;
But at time of check, new vma is not marked as VM_SOFTDIRTY, while old is already marked.
This change could be a likely candidate http://marc.info/?l=linux-kernel&m=138012715018064. I say this loosely as I don't have the resources to confirm. Its worth noting that this was the only significant change to the scheduler between 3.11.10 and 3.12.6.
Anyhow I'm very interested to see the end results of your findings so keep us posted.

Linux Multi-Threaded Performance Enhancements for File open()

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

Resources