How to disable cache prefetching on AMD family 17h CPUs? - performance

Is there a way to disable cache prefetching on AMD family 17h CPUs?
On family 10h CPUs, this is possible by setting a bit in MSRC001_1022 (programmatically disable hardware prefetching on AMD systems). This MSR is not available on family 17h CPUs (https://developer.amd.com/wp-content/resources/56255_3_03.PDF).

Related

Which mobile windows devices don't support AVX2

I understand that Intels AVX2 extension is on the market since 2011 and therefore it is pretty much standard in modern devices.
However, for some decision making we need to find out, roughly, the share of existing mobile windows devices which don't support AVX2 (nor its successor AVX-512).
It is rather well documented, which CPUs, Intel and AMD, actually support the extension. So that is not what I am asking for.
How do I find which mobile windows devices exist on the market, including recent years, that have processors which don't yet support the AVX2 instruction set?
You're incorrect about the dates, and about being "pretty much standard", unfortunately. It could have been by now if Intel hadn't disabled it for market-segmentation reasons in their low-end CPUs. (To be slightly fair, that may have let them sell chips with defects in one half of a 256-bit execution unit, improving yields).
All AMD CPUs aimed at mobile/laptop use (not Geode), including low-power CPUs since Jaguar, have had AVX since Bulldozer. Their low-power CPUs decode 256-bit instructions to two 128-bit uops, same as they did in Bulldozer-family and Zen1. (Which meant it wasn't always worth using in Bulldozer-family, but it wasn't a lot slower than carefully-tuned SSE, and sometimes still faster, and meant software had that useful baseline. And 128-bit AVX instructions are great everywhere, often saving instructions by being 3 operand.) Intel used the same decode into 2 halves strategy in Gracemont as the E-cores for Alder Lake, like they did for SSE in P6 CPUs before Core 2, like Pentium III and Pentium M.
AVX was new in Sandy Bridge (2011) and Bulldozer (2011), AVX2 was new in Haswell (2013) and Excavator (2015).
Pentium/Celeron versions of Skylake / Coffee Lake etc. (lower end than i3) have AVX disabled, along with AVX2/FMA/BMI1/2. BMI1 and 2 include some instructions that use VEX encodings on general-purpose integer registers, which seems to indicate that Intel disables decoding of VEX prefixes entirely as part of binning a silicon chip for use in low-end SKUs.
The first Pentium/Celeron CPUs with AVX1/2/FMA are Ice Lake / Tiger Lake based. There are currently Alder Lake based Pentiums with AVX2, like 2c4t (2 P cores) Pentium Gold G7400 and Pentium Gold 8505 (mobile 1 P core, 4 E cores). So 7xxx and 8xxx and higher Pentiums should have AVX1 / AVX2 / FMA, but earlier ones mostly not. One of the latest without AVX is desktop Pentium Gold G6405, 2c4t Comet Lake launched in Q1 2021. (The mobile version, 6405U, launched in Q4'19). There's also an "Amber Lake Y" Pentium Gold 6500Y with AVX2, launched Q1'21.
Low-power CPUs in the Silvermont family (up to 2019's Tremont) don't support AVX at all.
These are common in "netbook" and low budget laptops, as well as low-power servers / NAS. (The successor, Gracemont, has AVX1/2/FMA, so it can work as the E-cores in Alder Lake.)
These CPUs are badged as Pentium J-series and N-series. For example, Intel Pentium Processor N6415 launched in 2021, with 4 cores, aimed at "PC/Client/Tablet" use cases. These are Elkheart Lake (Tremont cores), with only SSE4.2.
The "Atom" brand name is still used on server versions of these, including chips with 20 cores.

Are caches of different level operating in the same frequency domain?

Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.
So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
And related questions are:
Are all function units in a core in the same clock domain?
Are the uncore part all in the same clock domain?
Are cores in the multi-core system synchronous?
I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?
The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).
On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.
Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).
Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.
The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.
How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.
On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.
The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)
The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.
On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.
On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:
On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!
... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%
The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.
Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.
I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.
Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.
More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.
A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.
In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.
In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.
There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.
There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.

How to disable prefetcher in Atom N270 processor

I am trying to disable hardware prefetching in my system with Atom processors(N270).
I am following the method as per the link How do I programatically disable hardware prefetching in core2duo ?
I am able to execute,
./rdmsr 0x1a0
366b52488
however, this gives error message
./wrmsr -p0 0x1a0 0x366d52688
wrmsr: Cpu 0 can't set MSR from 0x1a0 to 0x366d52688
Although I am able to set bit-0 and bit-3, no other bits are allowed to modify .
./wrmsr -p0 0x1a0 0x366b52489
As per this link disable prefetcher in i3/i7 hardware prefetcher in Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell can be disabled via MSR at 0x1a4 address .
In Atom processor, reading at 0x1a4 is not permitted.
./rdmsr 0x1a4
rdmsr: Cpu 0 can't read MSR from 0x000001a4
I am wondering how is it possible that there is no information available related to how to disable hardware prefetcher in Atom processor,
although Atom N270 and Core2duo processor are released at the same year (year 2008) and how to disable hardware prefetcher in Core2Duo is disclosed by Intel.
Any link to document to how can I disable prefetcher in Atom processors would be a great help? thank you in advance.
The only reliable source to find information like this is the Intel Architecture Software Developer's Manual. There is an entire chapter dedicated to MSR (In the most recent release it's Chapter 35).
The 0x1a4 or 0x1a0 MSR address that you found is machine dependent (that's why it's called Model Specific Register), meaning they might be only available on some models, or maybe removed in future models.
If you go the chapter "MSRS IN THE 45 NM AND 32 NM INTELĀ® ATOMTM PROCESSOR FAMILY", which matches what your Atom N270. You won't be able to find any MSR related to prefetcher setting. So it means in official Intel CPU release it's not available (though in some engineering sample it might be found).
There might be two reasons why it's not available, either it's not highly required feature thus removing it could save some silicon gates; or it's might be because Intel thinks this feature is best left default and not configurable to the public (subject to misuse by vendor or user and lead to poor performance in certain condition).
BTW, information about 0x1a4 MSR address could be found in IA SDM Chapter 35.5 and 0x1a0 in Chapter 35.2

How to disable the Last Level Cache only of Intel Ivybridge CPU?

I know how to disable all of the three levels of cache on Intel IvyBridge CPU. I only need to set the CD bit of CR0 register to 1 for all of CPUs.
However, I want to disable the last level of cache (L3 cache) only on Intel IvyBridget or SandyBridge CPU and keep using the L1 and L2 on chip cache.
The reason why I want to do this experiment is because I want to test the performance of the L3 cache and want to see the effect of not using the L3 cache.
Could any one give me a pointer or some insight on how to achieve that?

How can I read from the pinned (lock-page) RAM, and not from the CPU cache (use DMA zero-copy with GPU)?

If I use DMA for RAM <-> GPU on CUDA C++, How can I be sure that the memory will be read from the pinned (lock-page) RAM, and not from the CPU cache?
After all, with DMA, the CPU does not know anything about the fact that someone changed the memory and about the need to synchronize the CPU (Cache<->RAM). And as far as I know, std :: memory_barier () from C + +11 does not help with DMA and will not read from RAM, but only will result in compliance between the caches L1/L2/L3. Furthermore, in general, then there is no protocol to resolution conflict between cache and RAM on CPU, but only sync protocols different levels of CPU-cache L1/L2/L3 and multi-CPUs in NUMA: MOESI / MESIF
On x86, the CPU does snoop bus traffic, so this is not a concern. On Sandy Bridge class CPUs, the PCI Express bus controller is integrated into the CPU, so the CPU actually can service GPU reads from its L3 cache, or update its cache based on writes by the GPU.

Resources