Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings - parallel-processing

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191

I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.
numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.
You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).
Where is OpenMP library getting invoked in binding threads to physical cores?
AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.
If you want to be safe and portable, use the following configuration (using Bash):
OMP_NUM_THREADS=4
OMP_PROC_BIND=TRUE
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.

Related

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

MPICH2 on a machine with two NUMA nodes

I am new to MPI. I am using MPICH2 on a Linux machine with the following information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU # 2.20GHz
Stepping: 4
CPU MHz: 799.844
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My understanding is that I've got 2 nodes, 20 cores and 40 threads (i.e. processors) on this machine. Is this correct? If yes, I think I should set MPICH to spawn 20 processes (one process on each physical core), right? However, when I run the command mpiexec -n 20 MyProgram, the average CPU usage is only 50%. If I change to mpiexec -n 40 MyProgram, the CPU usage is 100% but the overall performance is actually becoming worse so I think I might be over-specifying.

CPU usage is a misleading metric. CPU usage reflects the portion of time some task was scheduled on a logical CPU. CPU average is just that, the average over all logical cores. So 50% CPU average can just mean that every other logical CPU has 100% usage, (and the others 0 %). So you observe this in a situation where each physical core is always utilized.
CPU usage, does mean resource utilization. There are workloads that benefit from using hyperthreading and workloads that don't. There are workloads that can be faster using less threads than physical cores (e.g. memory bandwidth limited). There are workloads that can be faster using more threads than logical CPUs (e.g. I/O latency limited).
Always use your performance metric (e.g. time) to figure out the best configuration. If you want to understand resource utilization you must look at many different performance metrics, cycles, instructions, memory bandwidth, cache, ....

How system is calculating to get the NumberOfLogicalProcessors in VB

For a Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz model cpu I am getting both the NumberOfCores and NumberOfLogicalProcessors as 4 .
I want to know how system is calculating NumberOfLogicalProcessors ?
What i should use to get the actual number of cpus ?
OS: win2k8 R2

That depends on what you meant by actual CPU.
Win32_Processor\NumberOfCores specifies the total number of physical CPU cores. A chip can contain one or more CPU cores.
Win32_Processor\NumberOfLogicalProcessors specifies the total number of virtual CPU cores. There can be two or more virtual CPU cores in one physical CPU core. On x86 compatible computers, this is only available in Intel's Hyper-Threading capable CPUs.
On the other hand, the Win32_ComputerSystem\NumberOfProcessors specifies the total number of physical processor chips installed on a multi processor motherboard.
The Win32_ComputerSystem\NumberOfLogicalProcessors is same as Win32_Processor\NumberOfLogicalProcessors.

Are not all processors created equal?

My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5) (which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA) (logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Measurements:
Plugged in (full CPU):
Affinity mask: 0x3 (0011b), 9 gflop/s
Affinity mask: 0x5 (0101b), 17 gflop/s
Affinity mask: 0x6 (0110b), 17 gflop/s
Affinity mask: 0x9 (1001b), 9 gflop/s
Affinity mask: 0xA (1010b), 9 gflop/s
Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
Affinity mask: 0x3 (0011b), 5 gflop/s
Affinity mask: 0x5 (0101b), 10 gflop/s
Affinity mask: 0x6 (0110b), 10 gflop/s
Affinity mask: 0x9 (1001b), 5 gflop/s
Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.)
Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
Update:
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(

SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.

No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.

It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.

Have you checked the return code from SetProcessAffinityMask to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result of GetProcessAffinityMask.
You say you've tried masks of 0x5, 0xA, and 0x9. I'd be curious to see the results with 0x3.

Task Manager: CPU usage history

I bougth recently a server with 2 x X5550, they are quad (4 cores each) total 8 cores
If I check the task manager it shows in the CPU usage history 16 diagrams,
Should't it be 8 cause I have 2 processors with quad?
or the diagrams maybee shows the Threads of the CPU?

The CPUs have support for HyperThreading, so each core x2 logical CPUs.
You can always lookup the chip specs on Intel's site

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio