I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.
I am new to MPI. I am using MPICH2 on a Linux machine with the following information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU # 2.20GHz
Stepping: 4
CPU MHz: 799.844
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
My understanding is that I've got 2 nodes, 20 cores and 40 threads (i.e. processors) on this machine. Is this correct? If yes, I think I should set MPICH to spawn 20 processes (one process on each physical core), right? However, when I run the command mpiexec -n 20 MyProgram, the average CPU usage is only 50%. If I change to mpiexec -n 40 MyProgram, the CPU usage is 100% but the overall performance is actually becoming worse so I think I might be over-specifying.
CPU usage is a misleading metric. CPU usage reflects the portion of time some task was scheduled on a logical CPU. CPU average is just that, the average over all logical cores. So 50% CPU average can just mean that every other logical CPU has 100% usage, (and the others 0 %). So you observe this in a situation where each physical core is always utilized.
CPU usage, does mean resource utilization. There are workloads that benefit from using hyperthreading and workloads that don't. There are workloads that can be faster using less threads than physical cores (e.g. memory bandwidth limited). There are workloads that can be faster using more threads than logical CPUs (e.g. I/O latency limited).
Always use your performance metric (e.g. time) to figure out the best configuration. If you want to understand resource utilization you must look at many different performance metrics, cycles, instructions, memory bandwidth, cache, ....
For a Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz model cpu I am getting both the NumberOfCores and NumberOfLogicalProcessors as 4 .
I want to know how system is calculating NumberOfLogicalProcessors ?
What i should use to get the actual number of cpus ?
OS: win2k8 R2
That depends on what you meant by actual CPU.
Win32_Processor\NumberOfCores specifies the total number of physical CPU cores. A chip can contain one or more CPU cores.
Win32_Processor\NumberOfLogicalProcessors specifies the total number of virtual CPU cores. There can be two or more virtual CPU cores in one physical CPU core. On x86 compatible computers, this is only available in Intel's Hyper-Threading capable CPUs.
On the other hand, the Win32_ComputerSystem\NumberOfProcessors specifies the total number of physical processor chips installed on a multi processor motherboard.
The Win32_ComputerSystem\NumberOfLogicalProcessors is same as Win32_Processor\NumberOfLogicalProcessors.
My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5) (which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA) (logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Measurements:
Plugged in (full CPU):
Affinity mask: 0x3 (0011b), 9 gflop/s
Affinity mask: 0x5 (0101b), 17 gflop/s
Affinity mask: 0x6 (0110b), 17 gflop/s
Affinity mask: 0x9 (1001b), 9 gflop/s
Affinity mask: 0xA (1010b), 9 gflop/s
Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
Affinity mask: 0x3 (0011b), 5 gflop/s
Affinity mask: 0x5 (0101b), 10 gflop/s
Affinity mask: 0x6 (0110b), 10 gflop/s
Affinity mask: 0x9 (1001b), 5 gflop/s
Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.)
Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
Update:
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(
SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.
No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.
It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.
Have you checked the return code from SetProcessAffinityMask to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result of GetProcessAffinityMask.
You say you've tried masks of 0x5, 0xA, and 0x9. I'd be curious to see the results with 0x3.
I bougth recently a server with 2 x X5550, they are quad (4 cores each) total 8 cores
If I check the task manager it shows in the CPU usage history 16 diagrams,
Should't it be 8 cause I have 2 processors with quad?
or the diagrams maybee shows the Threads of the CPU?
The CPUs have support for HyperThreading, so each core x2 logical CPUs.
You can always lookup the chip specs on Intel's site