Performance of my MPI code does not improve when I use two NUMA nodes (dual Xeon chips)

I have a computer, Precision-Tower-7810 dual Xeon E5-2680v3 #2.50GHz × 48 threads.
Here is result of $lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4988.40
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
My MPI code is based on basic MPI (Isend, Irecv, Wait, Bcast). Fundamentally, the data will be distributed and sent to all processors. On each processor, data is used to calculate something and its value is changed. After the above procedure, the amount of data on each processor is exchanged between all processors. This work is repeated to a limit.
Now, the main issue is that when I increase the number of processors within the limit of one chip (24 threads), performance increases. However, performance does not improve while the number of processors > 24 threads.
An example:
$mpiexec -n 6 ./mywork : 72s
$mpiexec -n 12 ./mywork : 46s
$mpiexec -n 24 ./mywork : 36s
$mpiexec -n 32 ./mywork : 36s
$mpiexec -n 48 ./mywork : 35s
I have tried on the both OpenMPI and MPICH, obtained result is the same. So, I think issue of physical connect type (NUMA nodes) of two chips. It is assumption of mine, I have never used a really supercomputer. I hope anyone know this issue and help me. Thank you for reading.


Discrepancy in output of lscpu

I have a question about the performance impact when two boxes of same spec shows different results
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz
Stepping: 0
CPU MHz: 2999.999 <=============
BogoMIPS: 5999.99 <=============
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz
Stepping: 0
CPU MHz: 3000.00 <=============
BogoMIPS: 6000.00 <=============
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
For the same application running on both nodes I see the load average being 2x to 3x higher(compared to Box2) on Box1
The only difference I see in the output is the numbers being off by a fraction in CPU MHz in lscpu output.
Why do we see such difference for actual CPU and will there be a perf difference because of this?
The only difference is that bogomips calibration randomly got a slightly lower value, e.g. a crystal oscillator might be off by a fraction of a percent, or just a pure timing artifact between the CPU core clock vs. whatever clocks Linux uses to time the bogomips loop.
So that doesn't explain anything, and is unrelated to any significant software performance difference you observe. Obviously we can't tell you anything more without any details.
Possible guesses at an explanation for a big perf difference could include the RAM config, like do they both have all memory controllers populated?
Otherwise almost certainly some software difference. Like one running a debug build, or some difference in how you built the binary, or in the libraries it uses, or the kernel or kernel config. Or running on different data, and your application is sensitive to different data.
Or possibly if your VMware config has one of those VMs mapping its CPU cores to fewer physical cores on the bare metal, e.g. competing via hyperthreading when the kernel in the VM assumes they're not. Or if the guest kernel has wrong info about NUMA.
Of obviously if your VM is sharing the bare metal on one of them with some other workload!
There can be minor differences in inter-core latency between different instances of the same Xeon CPU model, depending on exactly where on the ring bus the enabled cores are. (Except on the top-end models for each core-count, some of the cores on each die are fused off due to defects or just for market segmentation.) But this is a very small effect, and only in inter-core latency.
But this is all kind of off-topic for your question about a .001 MHz difference in measured CPU frequency. We can safely say that's not an explanation. If you do want to ask about that, post a separate question with full details on your application. But probably it's going to be some difference only you can find, some wrong assumption about something being the same. Maybe run some other benchmarks on the machines, especially pre-compiled to rule out compiler differences.

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

MPICH2 on a machine with two NUMA nodes

I am new to MPI. I am using MPICH2 on a Linux machine with the following information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU # 2.20GHz
Stepping: 4
CPU MHz: 799.844
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My understanding is that I've got 2 nodes, 20 cores and 40 threads (i.e. processors) on this machine. Is this correct? If yes, I think I should set MPICH to spawn 20 processes (one process on each physical core), right? However, when I run the command mpiexec -n 20 MyProgram, the average CPU usage is only 50%. If I change to mpiexec -n 40 MyProgram, the CPU usage is 100% but the overall performance is actually becoming worse so I think I might be over-specifying.
CPU usage is a misleading metric. CPU usage reflects the portion of time some task was scheduled on a logical CPU. CPU average is just that, the average over all logical cores. So 50% CPU average can just mean that every other logical CPU has 100% usage, (and the others 0 %). So you observe this in a situation where each physical core is always utilized.
CPU usage, does mean resource utilization. There are workloads that benefit from using hyperthreading and workloads that don't. There are workloads that can be faster using less threads than physical cores (e.g. memory bandwidth limited). There are workloads that can be faster using more threads than logical CPUs (e.g. I/O latency limited).
Always use your performance metric (e.g. time) to figure out the best configuration. If you want to understand resource utilization you must look at many different performance metrics, cycles, instructions, memory bandwidth, cache, ....

Making sense of cpu info [closed]

I generally know that the more the number of processors the more processes (watching a movie, playing some game, running firefox with youtube playing a Simpson's episode, all simultaneously) you can have simultaneously going without your computer slowing down. But I want to know how to make sense of the linux commands cpuinfo and lscpu.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6800.18
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
and cpuinfo:
===== Processor composition =====
Processor name : Quad-Core AMD Opteron(tm) Processor 2354
Packages(sockets) : 2
Cores : 8
Processors(CPUs) : 8
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 0 1
5 0 1 1
6 0 2 1
7 0 3 1
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3 0,1,2,3
1 0,1,2,3 4,5,6,7
What exactly are they telling me. A dual core to me means two core per processor. I can see 8 CPU(s) listed. But what is the difference between thread and cores. I can see 2 Thread(s) per core. And what is a socket? I could not google a place where things are explained but there are plenty of places which tell you to use cpuinfo/lscpu.
What you call "core" is technically a "physical core", aka socket aka package.
A physical core is "virtually splitted" into logical cores (listed simply as "core(s)" by cpuinfo/lscpu.
So your system has 2 physical cores, each one divided into 4 logical cores. This sums up into 8 logical cores.
A similar question on tomshw:
A socket is on the motherboard, where you plug the processor inside and have a fan cooling it.
cpuinfo on your machine says that you have a motherboard with 2 sockets and 2 processors, which are each a Quad-Core AMD Opteron(tm) Processor 2354. So together you have 8 cores (2x quad (4) core) and also 8 threads available.
you ran lscpu on a different machine which has only one processor on the motherboard. This one is an intel quad core with Hyper-Threading.
A socket is a physical plug on your motherboard. A core is a physical part of a computer, while a thread is a specific path of execution on a core. This answer explains threads really well.
lscpu -
cpuinfo -
EDIT: whoops, got network sockets mixed in there for some reason. Just kidding.

How to compute the theoretical peak performance of CPU

Here is my cat /proc/cpuinfo output:
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5520 # 2.27GHz
stepping : 5
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic ...
bogomips : 4533.56
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management :
This machine has two CPUs, each with 4 cores with hyperthreading capability, so the total processor number is 16(2 CPU * 4 core * 2 hyperthreading). These processors have same output, to keep clean, I just show the last one's info and omit part of flags in the flags line.
So how do I calculate the peak performance of this machine in terms of GFlops?
Let me know if more info should be supplied.
You can check the Intel export spec.
The GFLOP in the chart is usually referred as the peak of a single chip.
It shows 36.256 Gflop/s for E5520.
This single chip has 4 physical cores with SSE.
So this GFLOP can also be calculated as:
2.26GHz*2(mul,add)*2(SIMD double precision)*4(physical core) = 36.2.
You system has two CPUs, so your peak is 36.2*2 = 72.4 GFLOP/S.
you can find a formula in this website:
here the formula:
performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node).
so in your case: 2.27x4x4x2=72.64 GFLOP/s
see here for the configuration of your CPU
