Mitigating shared library bottlenecks while running many processes

Mitigating shared library bottlenecks while running many processes - performance

I'm doing some benchmarking on a 2 socket motherboard with 2 Intel 6230's (total of 40 cores). The computer is running RHEL-7.6 and utilizes NUMA. My ultimate goal is to determine the performance difference between using Intel's MKL library on an Intel vs. an AMD machine.
I installed python-3.7.3 using Anaconda. Looking at numpy's shared library:
ldd /home/user/local/python/3.7/lib/python3.7/site-packages/numpy/linalg/lapack_lite.cpython-37m-x86_64-linux-gnu.so
linux-vdso.so.1 => (0x00002aaaaaacc000)
libmkl_rt.so => /home/user/local/python/3.7/lib/python3.7/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00002aaaaaccf000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab3b6000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaab5d2000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab995000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
You can see that it depends on libmkl_rt.so. Presumably the linear algebra routines like np.dot() depend on this. So I run the following code, numpy_test.py :
import numpy as np
matrixSize = 5000 # outer dim of mats to run on
N = 50 # num of trials
np.random.seed(42)
Ax = matrixSize
Ay = 10000
Bx = 10000
By = matrixSize
A=np.random.rand(Ax,Ay)
B=np.random.rand(Bx,By)
npStartTime = time.time()
for i in range(N):
AB = np.dot(A,B)
print("Run time : {:.4f} s".format((time.time() - npStartTime)))
Running this with one core(wrong, see below) takes about 17.5 seconds. If I run it on all 40 cores simultaneously, the average run time is 1200s for each process. This answer attempts to provide a solution for mitigating this problem. Two of the possible solutions don't even work and the third option (dplace) doesn't seem to be easily accessible for RHEL 7.6.
Question
Is it plausible that the huge performance hit when running 40 processes is due to all the processes competing for access to the shared library (presumably libmkl_rt.so) which only lives in one place in memory?
If true, are there modern solutions to force each core to use its own copy of a shared library? I can't seem to find a static version of libmkl_rt.so to build numpy on.
EDIT
Following the suggestion of Gennady.F.Intel, I ran :
$ export MKL_VERBOSE=1; python3 src/numpy_attempt.py
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.10GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x555555c71cc0,1,0x555555c71cc0,1) 2.58ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,5000,5000,10000,0x7fffffffc870,0x2aaad834b040,5000,0x2aaac05d2040,10000,0x7fffffffc878,0x2aaaf00c4040,5000) 370.98ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
.
.
So I think the contention for resources has more to do with the fact that each of my 40 instances is asking for 40 threads each for a total of 1600 threads. If I export MKL_NUM_THREADS=1 and run my 40 instances of numpy_test.py the average run time is ~440 seconds. Running a single instance of numpy_test.py on the machine takes 240s. I think the discrepancy is explained, but the questions are yet to be answered.

Related

MKL FFT performance on Intel Xeon 6248 - abrupt variations

I am working on an application which requires to Fourier Transform batches of 2-dimensional signals, stored using single-precision complex floats.
I wanted to test the idea of dissecting those signals into smaller ones and see whether I can improve the efficiency of my computation, considering that FLOPS in FFT operations grow in an O(Nlog(N)) fashion. Of course different signal sizes (in memory) may experience difference FLOPS/s performance, so in order to really see if this idea can work I made some experiments.
What I observed after doing the experiments was that performance was varying very abruptly when changing the signal size, jumping for example from 60 Gflops/s to 300 Gflops/s! I am wondering why is that the case.
I ran the experiments using:
Compiler: g++ 9.3.0 ( -Ofast )
Intel MKL 2020 (static linking)
MKL-threading: GNU
OpenMP environment:
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=20
Platform:
Intel Xeon Gold 6248
https://ark.intel.com/content/www/us/en/ark/products/192446/intel-xeon-gold-6248-processor-27-5m-cache-2-50-ghz.html
Profiling tool:
Score-P 6.0
Performance results:
To estimate the average FLOP rates I assume: # of Flops = Nbatch * 5*N*N*Log_2( N*N )
When using batches of 2D signals of size 201 x 201 elements (N = 201), the observed average performance was approximately 72 Gflops/s.
Then, I examined the performance using 2D signals with N = 101, 102, 103, 104 or 105. The performance results are shown on the figure below.
I also examined experiments with smaller size such as N = 51, 52, 53, 54 or 55. The results are again shown below.
An finally, for N = 26, 27, 28, 29 or 30.
I performed the experiments two times and the performance results are consistent! I really doubt it is noise... but again I feel is quite unrealistic to achieve 350 Gflops/s, or maybe not???
Has anyone experienced similar performance variations, or have some comments on this?

You can use FFT from either Intel MKL or Intel IPP(Intel® Integrated Performance Primitives
) libraries. So as mentioned earlier in the comments section, the article link which I have given helps to determine which library is best suited for your application.
If you are working on applications that are related to engineering, scientific and financial applications you can go with the Intel MKL library, and if you are working with imaging, vision, signal, security, and storage applications, the Intel IPP library helps in speed performance.
Intel® MKL is suitable for large problem sizes typical to FORTRAN and C/C++ high-performance computing above-mentioned applications for MKL.
Intel® IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications
For complete details please refer
https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-ipp-choosing-an-fft.html
https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top.html
https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.html

How to explain poor performance on Xeon processors for a loop with both sequential copy and a scattered store?

I stumbled upon a peculiar performance issue when running the following c++ code on some Intel Xeon processors:
// array_a contains permutation of [0, n - 1]
// array_b and inverse are initialized arrays
for (int i = 0; i < n; ++i) {
array_b[i] = array_a[i];
inverse[array_b[i]] = i;
}
The first line of the loop sequentially copies array_a into array_b (very few cache misses expected). The second line computes the inverse of array_b (many cache misses expected since array_b is a random permutation). We may also split the code into two separate loops:
for (int i = 0; i < n; ++i)
array_b[i] = array_a[i];
for (int i = 0; i < n; ++i)
inverse[array_b[i]] = i;
I would have expected the two versions (single vs. dual loop) to perform almost identically on relatively modern hardware. However, it appears that some Xeon processors are incredibly slow when executing the single loop version.
Below you can see the wall time in nano-seconds divided by n when running the snippet on a range of different processors. For the purpose of testing, the code was compiled using GCC 7.5.0 with flags -O3 -funroll-loops -march=native on a system with a Xeon E5-4620v4. Then, the same binary was used on all systems, using numactl -m 0 -N 0 on systems with multiple NUMA domains.
The used code is available on github. The interesting stuff is in the file runner.cpp.
[EDIT:] The assembly is provided here.
[EDIT:] New results including AMD EPYC.
On the various i7 models, the results are mostly as expected. Using the single loop is only slightly slower than the dual loops. This also holds for the Xeon E3-1271v3, which is basically the same hardware as an i7-4790. The AMC EPYC 7452 performed best by far, with almost no difference between the single and dual loop implementation. However, on the Xeon E5-2690v4 and E5-4620v4 systems using the single loop is incredibly slow.
In previous tests I also observed this strange performance issue on Xeon E5-2640 and E5-2640v4 systems. In contrast to that, there was no performance issue on several AMD EPYC and Opteron systems, and also no issues on Intel i5 and i7 mobile processors.
My question to the CPU experts thus is: Why does Intel's most high-end product-line perform so poorly compared to other CPUs? I am by far no expert in CPU architectures, so your knowledge and thoughts are much appreciated!

Maybe this is related to avx-512 frequendy throttling on Intel processors. These instructions generate a lot of heat, and if used on some circunstances the processor reduces the working frequency.
Here are some benchmarks for OpenSSL that show the effect. There is a rant from Linus Torvalds on this topic.
If avx-512 instructions are generated with "-march=native" you may be suffering this effect. Try disabling avx-512 with:
gcc -mno-avx512f

A fast solution to obtain the best ARIMA model in R (function `auto.arima`)

I have a data series composed by 2775 elements:
mean(series)
[1] 21.24862
length(series)
[1] 2775
max(series)
[1] 81.22
min(series)
[1] 9.192
I would like to obtain the best ARIMA model by using function auto.arima of package forecast:
library(forecast)
fit=auto.arima(Netherlands,stepwise=F,approximation = F)
But I am having a big problem: RStudio is running for an hour and a half without results. (I developed an R code to perform these calculations, employed on a Windows machine equipped with a 2.80GHz Intel(R) Core(TM) i7 CPU and 16.0 GB RAM.) I suspect that this is due to the length of time series. A solution could be the parallelization? (But I don't know how apply it).
Anyway, suggestions to speed this code? Thanks!

The forecast package has many of its functions built with parallel processing in mind. One of the arguments of the auto.arima() function is 'parallel'.
According to the package documentation, "If [parallel = ] TRUE and stepwise = FALSE, then the specification search is done in parallel.This can give a significant speedup on mutlicore machines."
If parallel = TRUE, it will automatically select how many 'cores' to use (for a laptop or desktop, it is often the number of cores * 2. For example, I have 4 cores and each core has 2 processors = 8 'cores'). If you want to manually set the number of cores, also use the argument num.cores.
I'd recommend checking out the e-book written by Hyndman all about the package. It is like a time-series forecasting bible.

Why does some Ruby code run twice as fast on a 2.53GHz than on a 2.2GHz Core 2 Duo processor?

(This question attempts to find out why the running of a program can be different on different processors, so it is related to the performance aspect of programming.)
The following program will take 3.6 seconds to run on a Macbook that has 2.2GHz Core 2 Duo, and 1.8 seconds to run on a Macbook Pro that has 2.53GHz Core 2 Duo. Why is that?
That's a bit weird... why doubling the speed when the CPU is only 15% faster in clock speed? I double checked the CPU meter to make sure none of the 2 cores are in 100% usage (so as to see the CPU is not busy running something else). Could it be because one is Mac OS X Leopard and one is Mac OS X Snow Leopard (64 bit)? Both are running Ruby 1.9.2.
p RUBY_VERSION
p RUBY_DESCRIPTION if defined? RUBY_DESCRIPTION
n = 9_999_999
p n
t = 0; 1.upto(n) {|i| t += i if i%3==0 || i%5==0}; p t
The following are just output of the program:
On 2.2GHz Core 2 Duo: (Update: Macbook identifier: MacBook3,1, therefore probably is Intel Core 2 Duo (T7300/T7500))
$ time ruby 1.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]"
9999999
23333331666668
real 0m3.784s
user 0m3.751s
sys 0m0.021s
2.53GHz Intel Core 2 Duo: (Update: Macbook identifier: MacBookPro5,4, therefore probably is Intel Core 2 Duo Penryn with 3 MB on-chip L2 cache)
$ time ruby 1.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]"
9999999
23333331666668
real 0m1.893s
user 0m1.809s
sys 0m0.012s
Test run on Windows 7:
time_start = Time.now
p RUBY_VERSION
p RUBY_DESCRIPTION if defined? RUBY_DESCRIPTION
n = 9_999_999
p n
t = 0; 1.upto(n) {|i| t += i if i%3==0 || i%5==0}; p t
print "Took #{Time.now - time_start} seconds to run\n"
Intel Q6600 Quad Core 2.4GHz running Windows 7, 64-bit:
C:\> ruby try.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18) [i386-mingw32]"
9999999
23333331666668
Took 3.248186 seconds to run
Intel 920 i7 2.67GHz running Windows 7, 64-bit:
C:\> ruby try.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18) [i386-mingw32]"
9999999
23333331666668
Took 2.044117 seconds to run
It is also strange why an i7 with 2.67GHz is slower than a 2.53GHz Core 2 Duo.

I suspect that ruby is switching to a arbitrary-precision integer implementation
later on the 64 bit os.
Quoting the Fixnum ruby doc:
A Fixnum holds Integer values that can
be represented in a native machine
word (minus 1 bit). If any operation
on a Fixnum exceeds this range, the
value is automatically converted to a
Bignum.
Here, a native machine word is technically 64 bit, but the interpreter is compiled to run on 32 bit processors.

why doubling the speed when the CPU is only 15% faster in clock speed?
Quite simply because the performance of computers is determined not solely by CPU clock speed.
Other things to consider are:
CPU architectures, including e.g. the number of cores on a CPU, or the general ability to run multiple instructions in parallel
other clock speeds in the system (memory, FSB)
CPU cache sizes
installed memory chips (some are faster than others)
additionally installed hardware (might slow down the system through hardware interruptions)
different operating systems
32-bit vs. 64-bit systems
I'm sure there's a lot more things to add to the above list. I won't elaborate further on the point, but if anyone feels like it, please, feel free to add to the above list.

In out CI environment we have a lot of "pizza box" computers that are supposed to be identical. They have the same hardware, were installed at the same time, and should be generally identical. They're even placed in "thermally equivalent" locations. They're not identical, and the variation can be quite stunning.
The only conclusion I have come up with is different binnings of CPU will have different thresholds for thermal stepping; some of the "best" chips hold up better. I also suspect other "minor" hardware faults/variations to be playing a role here. Maybe the slow boxes have slightly different components that play less well together ?
There are tools out there that will show you if your CPU is throttling for thermal reasons.

I don't know much Ruby, but your code doesn't look to be multithreaded, if that's the case it's not going to take advantage of multiple cores. There can also be large differences between two CPU models. You have smaller process sizes, larger caches, better SIMD instructions sets, faster memory access, etc... Compiler & OS differences can cause large swings in performance between Windows & Linux, this can also be said for x86 vs x64. Plus Core i7s support HyperThreading which in some cases makes a single threaded app slower.
Just as an example, if that 2.2Ghz CPU is an Intel Core2 E4500 it has the following specs:
Clock: 2.2Ghz
L2 Cache: 2MB
FSB: 800MT/sec
Process Size: 65nm
vs a T9400 which is likely in your MacBook Pro
Clock: 2.53Ghz
L2 Cache: 6MB
FSB: 1066MT/sec
Process Size: 45nm
Plus you're running it on an x64 build of Darwin. All those things could definitely add up to inflating a trivial little script into executing much faster.

didn't read the code, But.. It is really hard to test on 2 different computers.
You need exactly the same os, same processes, same amount of memory.
If you change the processor family (i7, core2due, P4, P4-D) - the processor frequency says nothing on each processor abilities against another family. you can only compare in the same family (a newer processor might invest cycles in core management rather in computation for example)

GNU make: should the number of jobs equal the number of CPU cores in a system?

There seems to be some controversy on whether the number of jobs in GNU make is supposed to be equal to the number of cores, or if you can optimize the build time by adding one extra job that can be queued up while the others "work".
Is it better to use -j4 or -j5 on a quad core system?
Have you seen (or done) any benchmarking that supports one or the other?

I've run my home project on my 4-core with hyperthreading laptop and recorded the results. This is a fairly compiler-heavy project but it includes a unit test of 17.7 seconds at the end. The compiles are not very IO intensive; there is very much memory available and if not the rest is on a fast SSD.
1 job real 2m27.929s user 2m11.352s sys 0m11.964s
2 jobs real 1m22.901s user 2m13.800s sys 0m9.532s
3 jobs real 1m6.434s user 2m29.024s sys 0m10.532s
4 jobs real 0m59.847s user 2m50.336s sys 0m12.656s
5 jobs real 0m58.657s user 3m24.384s sys 0m14.112s
6 jobs real 0m57.100s user 3m51.776s sys 0m16.128s
7 jobs real 0m56.304s user 4m15.500s sys 0m16.992s
8 jobs real 0m53.513s user 4m38.456s sys 0m17.724s
9 jobs real 0m53.371s user 4m37.344s sys 0m17.676s
10 jobs real 0m53.350s user 4m37.384s sys 0m17.752s
11 jobs real 0m53.834s user 4m43.644s sys 0m18.568s
12 jobs real 0m52.187s user 4m32.400s sys 0m17.476s
13 jobs real 0m53.834s user 4m40.900s sys 0m17.660s
14 jobs real 0m53.901s user 4m37.076s sys 0m17.408s
15 jobs real 0m55.975s user 4m43.588s sys 0m18.504s
16 jobs real 0m53.764s user 4m40.856s sys 0m18.244s
inf jobs real 0m51.812s user 4m21.200s sys 0m16.812s
Basic results:
Scaling to the core count increases the performance nearly linearly. The real time went down from 2.5 minutes to 1.0 minute (2.5x as fast), but the time taken during compile went up from 2.11 to 2.50 minutes. The system noticed barely any additional load in this bit.
Scaling from the core count to the thread count increased the user load immensely, from 2.50 minutes to 4.38 minutes. This near doubling is most likely because the other compiler instances wanted to use the same CPU resources at the same time. The system is getting a bit more loaded with requests and task switching, causing it to go to 17.7 seconds of time used. The advantage is about 6.5 seconds on a compile time of 53.5 seconds, making for a 12% speedup.
Scaling from thread count to double thread count gave no significant speedup. The times at 12 and 15 are most likely statistical anomalies that you can disregard. The total time taken increases ever so slightly, as does the system time. Both are most likely due to increased task switching. There is no benefit to this.
My guess right now: If you do something else on your computer, use the core count. If you do not, use the thread count. Exceeding it shows no benefit. At some point they will become memory limited and collapse due to that, making the compiling much slower. The "inf" line was added at a much later date, giving me the suspicion that there was some thermal throttling for the 8+ jobs. This does show that for this project size there's no memory or throughput limit in effect. It's a small project though, given 8GB of memory to compile in.

I would say the best thing to do is benchmark it yourself on your particular environment and workload. Seems like there are too many variables (size/number of source files, available memory, disk caching, whether your source directory & system headers are located on different disks, etc.) for a one-size-fits-all answer.
My personal experience (on a 2-core MacBook Pro) is that -j2 is significantly faster than -j1, but beyond that (-j3, -j4 etc.) there's no measurable speedup. So for my environment "jobs == number of cores" seems to be a good answer. (YMMV)

I, personally, use make -j n where n is "number of cores" + 1.
I can't, however, give a scientific explanation: I've seen a lot of people using the same settings and they gave me pretty good results so far.
Anyway, you have to be careful because some make-chains simply aren't compatible with the --jobs option, and can lead to unexpected results. If you're experiencing strange dependency errors, just try to make without --jobs.

Both are not wrong. To be at peace with yourself and with author of software you're compiling (different multi-thread/single-thread restrictions apply at software level itself), I suggest you use:
make -j`nproc`
Notes: nproc is linux command that will return number of cores/threads(modern CPU) available on system. Placing it under ticks ` like above will pass the number to the make command.
Additional info: As someone mentioned, using all cores/threads to compile software can literally choke your box to near death (being unresponsive) and might even take longer than using less cores. As I seen one Slackware user here posted he had dual core CPU but still provided testing up to j 8, which stopped being different at j 2 (only 2 hardware cores that CPU can utilize). So, to avoid unresponsive box i suggest you run it like this:
make -j`nproc --ignore=2`
This will pass the output of nproc to make and subtract 2 cores from its result.

Ultimately, you'll have to do some benchmarks to determine the best number to use for your build, but remember that the CPU isn't the only resource that matters!
If you've got a build that relies heavily on the disk, for example, then spawning lots of jobs on a multicore system might actually be slower, as the disk will have to do extra work moving the disk head back and forth to serve all the different jobs (depending on lots of factors, like how well the OS handles the disk-cache, native command queuing support by the disk, etc.).
And then you've got "real" cores versus hyper-threading. You may or may not benefit from spawning jobs for each hyper-thread. Again, you'll have to benchmark to find out.
I can't say I've specifically tried #cores + 1, but on our systems (Intel i7 940, 4 hyperthreaded cores, lots of RAM, and VelociRaptor drives) and our build (large-scale C++ build that's alternately CPU and I/O bound) there is very little difference between -j4 and -j8. (It's maybe 15% better... but nowhere near twice as good.)
If I'm going away for lunch, I'll use -j8, but if I want to use my system for anything else while it's building, I'll use a lower number. :)

I just got an Athlon II X2 Regor proc with a Foxconn M/B and 4GB of G-Skill memory.
I put my 'cat /proc/cpuinfo' and 'free' at the end of this so others can see my specs. It's a dual core Athlon II x2 with 4GB of RAM.
uname -a on default slackware 14.0 kernel is 3.2.45.
I downloaded the next step kernel source (linux-3.2.46) to /archive4;
extracted it (tar -xjvf linux-3.2.46.tar.bz2);
cd'd into the directory (cd linux-3.2.46);
and copied the default kernel's config over (cp /usr/src/linux/.config .);
used make oldconfig to prepare the 3.2.46 kernel config;
then ran make with various incantations of -jX.
I tested the timings of each run by issuing make after the time command, e.g.,
'time make -j2'. Between each run I 'rm -rf' the linux-3.2.46 tree and reextracted it, copied the default /usr/src/linux/.config into the directory, ran make oldconfig and then did my 'make -jX' test again.
plain "make":
real 51m47.510s
user 47m52.228s
sys 3m44.985s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j2
real 27m3.194s
user 48m5.135s
sys 3m39.431s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j3
real 27m30.203s
user 48m43.821s
sys 3m42.309s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j4
real 27m32.023s
user 49m18.328s
sys 3m43.765s
bob#Moses:/archive4/linux-3.2.46$
as above but with make -j8
real 28m28.112s
user 50m34.445s
sys 3m49.877s
bob#Moses:/archive4/linux-3.2.46$
'cat /proc/cpuinfo' yields:
bob#Moses:/archive4$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.91
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.94
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
'free' yields:
bob#Moses:/archive4$ free
total used free shared buffers cached
Mem: 3991304 3834564 156740 0 519220 2515308

Just as a ref:
From Spawning Multiple Build Jobs section in LKD:
where n is the number of jobs to spawn. Usual practice is to spawn one or two jobs per processor. For example, on a dual processor machine, one might do
$ make j4

From my experience, there must be some performance benefits when adding extra jobs.
It is simply because disk I/O is one of the bottle necks besides CPU. However it is not easy to decide on the number of extra jobs as it is highly inter-connected with the number of cores and types of the disk being used.

Many years later, the majority of these answers are still correct. However, there has been a bit of a change: Using more jobs than you have physical cores now gives a genuinely significant speedup. As an addendum to Dascandy's table, here's my times for compiling a project on a AMD Ryzen 5 3600X on linux. (The Powder Toy, commit c6f653ac3cef03acfbc44e8f29f11e1b301f1ca2)
I recommend checking yourself, but I've found with input from others that using your logical core count for job count works well on Zen. Alongside that, the system does not seem to lose responsiveness. I imagine this applies to recent Intel CPUs as well. Do note I have an SSD, as well, so it may be worth it to test your CPU yourself.
scons -j1 --release --native 120.68s user 9.78s system 99% cpu 2:10.60 total
scons -j2 --release --native 122.96s user 9.59s system 197% cpu 1:07.15 total
scons -j3 --release --native 125.62s user 9.75s system 292% cpu 46.291 total
scons -j4 --release --native 128.26s user 10.41s system 385% cpu 35.971 total
scons -j5 --release --native 133.73s user 10.33s system 476% cpu 30.241 total
scons -j6 --release --native 144.10s user 11.24s system 564% cpu 27.510 total
scons -j7 --release --native 153.64s user 11.61s system 653% cpu 25.297 total
scons -j8 --release --native 161.91s user 12.04s system 742% cpu 23.440 total
scons -j9 --release --native 169.09s user 12.38s system 827% cpu 21.923 total
scons -j10 --release --native 176.63s user 12.70s system 910% cpu 20.788 total
scons -j11 --release --native 184.57s user 13.18s system 989% cpu 19.976 total
scons -j12 --release --native 192.13s user 14.33s system 1055% cpu 19.553 total
scons -j13 --release --native 193.27s user 14.01s system 1052% cpu 19.698 total
scons -j14 --release --native 193.62s user 13.85s system 1076% cpu 19.270 total
scons -j15 --release --native 195.20s user 13.53s system 1056% cpu 19.755 total
scons -j16 --release --native 195.11s user 13.81s system 1060% cpu 19.692 total
( -jinf test not included, as it is not supported by scons.)
Tests done on Ubuntu 19.10 w/ a Ryzen 5 3600X, Samsung 860 Evo SSD (SATA), and 32GB RAM
Final note: Other people with a 3600X may get better times than me. When doing this test, I had Eco mode enabled, reducing the CPU's speed a little.

YES! On my 3950x, I run -j32 and it saves hours of compile time! I can still watch youtube, browse the web, etc. during compile without any difference. The processor isn't always pegged even with a 1TB 970 PRO nvme or 1TB Auros Gen4 nvme and 64GB of 3200C14. Even when it is, I don't notice UI wise. I plan on testing with -j48 in the near future on some big upcoming projects. I expect, as you probably do, to see some impressive improvement. Those still with a quad-core might not get the same gains....
Linus himself just upgraded to a 3970x and you can bet your bottom dollar, he is at least running -j64.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio