Are not all processors created equal? - windows

My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5) (which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA) (logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Measurements:
Plugged in (full CPU):
Affinity mask: 0x3 (0011b), 9 gflop/s
Affinity mask: 0x5 (0101b), 17 gflop/s
Affinity mask: 0x6 (0110b), 17 gflop/s
Affinity mask: 0x9 (1001b), 9 gflop/s
Affinity mask: 0xA (1010b), 9 gflop/s
Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
Affinity mask: 0x3 (0011b), 5 gflop/s
Affinity mask: 0x5 (0101b), 10 gflop/s
Affinity mask: 0x6 (0110b), 10 gflop/s
Affinity mask: 0x9 (1001b), 5 gflop/s
Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.)
Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
Update:
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(

SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.

No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.

It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.

Have you checked the return code from SetProcessAffinityMask to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result of GetProcessAffinityMask.
You say you've tried masks of 0x5, 0xA, and 0x9. I'd be curious to see the results with 0x3.

Related

Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.
numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.
You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).
Where is OpenMP library getting invoked in binding threads to physical cores?
AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.
If you want to be safe and portable, use the following configuration (using Bash):
OMP_NUM_THREADS=4
OMP_PROC_BIND=TRUE
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.

Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.
In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.
Here's the PDF. The picture is on page 40.
.
Here's another picture from page 78, this picture describes "Store Address" and "Store Data":
Prepares the store forwarding and store retirement logic with the address of the data being stored.
Prepares the store forwarding and store retirement logic with the data being stored.
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.
It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?
Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.
But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle
Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.
Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.
On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.
On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.
Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.
A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.
Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).
As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.
Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.
That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.
One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.
I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.
A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.
How does store to load forwarding happens in case of unaligned memory access?
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.
Its been a few days without a response, so here's my best guess at "answering my own question".
The raw x86 instruction set isn't executed directly by modern processors. Instead, the x86 instruction set is "compiled" down into Micro-ops (uOps) before being executed by the Intel core. This shouldn't be too surprising, because some x86 instructions can be complex. An example taken from the optimization guide is as follows:
Similarly, the following store instruction has three register sources and is broken into "generate store
address" and "generate store data" sub-components.
MOV [ESP+ECX*4+12345678], AL
This is currently found on page 50 of the optimization manual (2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD)).
In this case, the address of the store operation is complex, so it is its own uOp. So at very least, this singular x86 instruction gets converted into two uOps internally. The names of these two uOps are "Store Address" and "Store Data". The manual doesn't describe the internal uOps at all, so it may take even more than two uOps to accomplish.
Since there's only one "store data" port on Skylake systems, that means that Skylake can only modify at most one memory location per cycle. The three "Store Address" ports means that Skylake can calculate the effective address of many instructions simultaneously (possibly because some very complicated addresses may take more than one uOp to execute??).

why is mpi slower on my laptop

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.
A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Task Manager: CPU usage history

I bougth recently a server with 2 x X5550, they are quad (4 cores each) total 8 cores
If I check the task manager it shows in the CPU usage history 16 diagrams,
Should't it be 8 cause I have 2 processors with quad?
or the diagrams maybee shows the Threads of the CPU?
The CPUs have support for HyperThreading, so each core x2 logical CPUs.
You can always lookup the chip specs on Intel's site

Mapping logical processors to physical processors

On a dual quad-core GetProcessAffinityMask (or the dialog from "Set affinity" in taskman.exe) will report eight logical processors. How do I find out which logical processor is on which physical processor? Especially: which logical processors are on the same physical processor?
EDIT: If it is not possible to do this programmatically, do anyone just know what the normal mapping is? Are the first four on the first processor and the second four on the second or are the odd numbered on the first and the even numbered on the second?
You can use Win32_Processor WMI class to query the number of cores, number of logical processors, architecture, cache memory and other information about the CPUs on the system.
To query information about the relationship between the logical processors in a system, you can use GetLogicalProcessorInformation API function.
In case you don't want to write the code yourself, SysInternal's handy coreinfo utility comes closest to answering your questions. It implements GetLogicalProcessorInformation as Mehrdad recommends. For a Xeon E5640 (quad core, 8 threads), you get from coreinfo:
c:\App\SysInternals>Coreinfo.exe -c
Coreinfo v3.0 - Dump information on system CPU and memory topology
Copyright (C) 2008-2011 Mark Russinovich
Sysinternals - www.sysinternals.com
Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)
There are 8 * for the 8 hyperthreads, two per core, as expected for this chip. What's not clear, though, is how the arrangement of * matches up with the list of logical processors as Windows presents them. For instance, Task Manager gives me a dialog for assigning the processor affinity, labeled CPU 0 through CPU 7, for any process. It's fair (but not necessary) to assume that you can take coreinfo's output and number the logical processors left-to-right. So "CPU 5" would be the second hyperthread running on physical processor 2.
The numbering is done in a sequential manner: first all physical cores followed by the logical cores [1] .
[1] CPU Numbering on a hypertheading enabled system

Resources