The last level cache replacement policy - caching

I've found a blog article about Intel IvyBridge cache replacement policy. He concluded that Ivy Bridge's L3 cache replacement policy is no longer pseudo-LRU.
Under the new cache replacement policy, suppose that there are 4 sets in L3, and set 0 and 1 are using by a process. Set 2 and 3 are available to allocate. If a new process from other cpu tries to load two pages into L3 cache, is it guaranteed that the new process loads its pages into set 2 and 3? In other words, if there are available cache sets in the last level cache, does HW always choose the available sets to load new pages?

Related

Modify the write policy of the cache [duplicate]

Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them.
Sometimes the programmer may want some data to not be cached in some or all cache levels (for example, when wanting to memset 16 GB of RAM and keep some data still in the cache): there are some non-temporal (NT) instructions for this like MOVNTDQA (https://stackoverflow.com/a/37092 http://lwn.net/Articles/255364/)
But is there a programmatic way (for some AMD or Intel CPU families like P3, P4, Core, Core i*, ...) to completely (but temporarily) turn off some or all levels of the cache, to change how every memory access instruction (globally or for some applications / regions of RAM) uses the memory hierarchy? For example: turn off L1, turn off L1 and L2? Or change every memory access type to "uncached" UC (CD+NW bits of CR0??? SDM vol3a pages 423 424, 425 and "Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (Available only in processors based on Intel NetBurst microarchitecture) — Allows the L3 cache to be disabled and enabled, independently of the L1 and L2 caches.").
I think such action will help to protect data from cache side channel attacks/leaks like stealing AES keys, covert cache channels, Meltdown/Spectre. Although this disabling will have an enormous performance cost.
PS: I remember such a program posted many years ago on some technical news website, but can't find it now. It was just a Windows exe to write some magical values into an MSR and make every Windows program running after it very slow. The caches were turned off until reboot or until starting the program with the "undo" option.
The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:
11.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:
Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.
Flush all caches using the WBINVD instruction.
Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.
The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.
That's a long quote but it boils down to this code
;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30 ; Set bit CD
and eax, ~(1<<29) ; Clear bit NW
mov cr0, eax
;Step 2 - Invalidate all the caches
wbinvd
;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.
;For Atom processors, we are done, UC semantic is automatically enforced.
xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE ;MSR number is 2FFH
wrmsr
;P4 only, remove this code from the L1I
wbinvd
most of which is not executable from user mode.
AMD's manual 2 provides a similar algorithm in section 7.6.2
7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.
Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.
Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:
Writes the cache line back if it is in the modified or owned state.
Invalidates the cache line.
Performs a non-cacheable main-memory access to read or write the data.
If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.
The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.
Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.
[...]
In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.
This translates to this code (very similar to the Intel's one):
;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax
;For some models we need to invalidated the L1I
wbinvd
;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType ;MSR number is 2FFH
wrmsr
Caches can also be selectively disabled at:
Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].
When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.
By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).
By setting the caching type to UC or UC- for specific physical areas.
Of these options only the page attributes can be exposed to user mode programs (see for example this).

MESI protocol snoop implementation issue

I have a MESI protocol question. Assume that I have two cores (core 1 and 2) and each core has its own l2 cache. When two core has the same data and cache lines are in status S, meaning they both have clean and the same data. At t=0, core 1 writes the cache line and core 1 will switch to M (modified) and core 2 eventually will be in I (invalid) state. In physical world, it takes time for this transaction to finish. Let's say it takes 5 seconds for cache 2 knows that cache 1 updated the cache line.
Assume that at t=2, core 2 writes the same cache line and switch to M status. This write action from core 2 will be notified to the core 1 at t=7 (2+5). Then core 2 needs to invalidates cache 2 at t=5 and core 1 invalidates the line at t=7. Now both lines are invalidated and the data written by the core 1 and then core 2 got lost. This is obviously not following the protocol. What is wrong with my logic and how to prevent this nonsense?
The two cores have to agree with each other to update. You can do this via snoopy or directory based protocol. So in your example, the caches cannot change their state but rather request to change. Whoever then wins the arbitration gets to change to modified, while the other is invalidated.
These slides seem to sum it up pretty well. You want to look at slide 20 onward for snoopy protocol as an example.

Do typical multicore processors have muliple ports from L1 to L2

For typical x86 multicore processors, let us say, we have a processor with 2 cores and both cores encounter an L1 instruction cache miss when reading an instruction. Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines. Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
For typical x86 multicore processors, let us say, we have a processor with 2 cores
Ok, let use some early variant of Intel Core 2 Duo with two cores (Conroe). They have 2 CPU cores, 2 L1i caches and shared L2 cache.
and both cores encounter an L1 instruction cache miss when reading an instruction.
Ok, there will be miss in L1i to read next instruction (miss in L1d, when you access the data, works in similar way, but there are only reads from L1i and reads&writes from L1d). Each L1i with miss will generate request to next layer of memory hierarchy, to the L2 cache.
Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines.
Now we must to know how the caches are organized (This is classic middle-detail cache scheme which is logically similar to real hardware). Cache is memory array with special access circuits, and it looks like 2D array. We have many sets (64 in this picture) and each set has several ways. When we ask cache to get data from some address, the address is split into 3 parts: tag, set index and offset inside cache line. Set index is used to select the set (row in our 2D cache memory array), then tags in all ways are compared (to find right column in 2D array) with tag part of the request address, this is done in parallel by 8 tag comparators. If there is tag in cache equal to request address tag part, cache have "hit" and cache line from the selected cell will be returned to the requester.
Ways and sets; 2D array of cache (image from http://www.cnblogs.com/blockcipher/archive/2013/03/27/2985115.html or http://duartes.org/gustavo/blog/post/intel-cpu-caches/)
The example where set index 2 was selected, and parallel tag comparators give a "hit" (tag equality) for the Way 1:
What is the "port" to some memory or to cache? This is hardware interface between external hardware blocks and the memory, which has lines for request address (set by external block, for L1 it is set by CPU, for L2 - by L1), access type (load or store; may be fixed for the port), data input (for stores) and data output with ready bit (set by memory; cache logic handles misses too, so it return data both on hit and on miss, but it will return data for miss later).
If we want to increase true port count, we should increase hardware: for raw SRAM memory array we should add two transistor for every bit to increase port count by 1; for cache we should duplicate ALL tag comparator logic. But this has too high cost, so there are no much multiported memory in CPU, and if it has several ports, the total count of true ports is small.
But we can emulate having of several ports. http://web.eecs.umich.edu/~twenisch/470_F07/lectures/15.pdf EECS 470 2007 slide 11:
Parallel cache access is harder than parallel FUs
fundamental difference: caches have state, FUs don’t
one port affects future for other ports
Several approaches used
true multi‐porting
multiple cache copies
virtual multi‐porting
multi‐banking (interleaving)
line buffers
Multi-banking (sometimes called slicing) is used by modern chips ("Intel Core i7 has four banks in L1 and eight banks in L2"; figure 1.6 from page 9 of ISBN 1598297546 (2011) - https://books.google.com/books?id=Uc9cAQAAQBAJ&pg=PA9&lpg=PA9 ). It means, that there are several hardware caches of smaller sizes, and some bits of request address (part of set index - think the sets - rows as splitted over 8 parts or having colored into interleaved rows) are used to select bank. Each bank has low number of ports (1) and function just like classic cache (and there is full set of tag comparators in each bank; but the height of bank - number of sets in it is smaller, and every tag in array is routed only to single tag comparator - cheap as in single ported cache).
Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
If two accesses are routed to different L2 banks (slices), then cache behave like multiported and can handle both requests at the same time. But if both are routed to the single bank with single port, they will be serialized for the cache. Cache serialization may cost several ticks and request will be stalled near port; CPU will see this as slightly more access latency.

How much data is loaded in to the L2 and L3 caches?

If I have this class:
class MyClass{
short a;
short b;
short c;
};
and I have this code performing calculations on the above:
std::vector<MyClass> vec;
//
for(auto x : vec){
sum = vec.a * (3 + vec.b) / vec.c;
}
I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).
How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?
L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.
For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)
IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)
Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.
It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.
Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.
It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4 cache lines.
How many cache ways do I have? I am not even sure what a cache way is. Can someone explain that? I have done some searching, the best example was
http://download.intel.com/design/intarch/papers/cache6.pdf
But I am still confused.
Thanks.
The cache you are referring to is known as set associative cache. The whole cache is divided into sets and each set contains 4 cache lines(hence 4 way cache). So the relationship stands like this :
cache size = number of sets in cache * number of cache lines in each set * cache line size
Your cache size is 32KB, it is 4 way and cache line size is 32B. So the number of sets is
(32KB / (4 * 32B)) = 256
If we think of the main memory as consisting of cache lines, then each memory region of one cache line size is called a block. So each block of main memory will be mapped to a cache line (but not always to a particular cache line, as it is set associative cache).
In set associative cache, each memory block will be mapped to a fixed set in the cache. But it can be stored in any of the cache lines of the set. In your example, each memory block can be stored in any of the 4 cache lines of a set.
Memory block to cache line mapping
Number of blocks in main memory = (1GB / 32B) = 2^25
Number of blocks in each page = (4KB / 32B) = 128
Each byte address in the system can be divided into 3 parts:
Rightmost bits represent byte offset within a cache line or block
Middle bits represent to which cache set this byte(or cache line) will be mapped
Leftmost bits represent tag value
Bits needed to represent 1GB of memory = 30 (1GB = (2^30)B)
Bits needed to represent offset in cache line = 5 (32B = (2^5)B)
Bits needed to represent 256 cache sets = 8 (2^8 = 256)
So that leaves us with (30 - 5 - 8) = 17 bits for tag. As different memory blocks can be mapped to same cache line, this tag value helps in differentiating among them.
When an address is generated by the processor, 8 middle bits of the 30 bit address is used to select the cache set. There will be 4 cache lines in that set. So tags of the all four resident cache lines are checked against the tag of the generated address for a match.
Example
If a 30 bit address is 00000000000000000-00000100-00010('-' separated for clarity), then
offset within the cache is 2
set number is 4
tag is 0
In their "Computer Organization and Design, the Hardware-Software Interface", Patterson and Hennessy talk about caches. For example, in this version, page 408 shows the following image (I have added blue, red, and green lines):
Apparently, the authors use only the term "block" (and not the "line") when they describe set-associative caches. In a direct-mapped cache, the "index" part of the address addresses the line. In a set-associative, it indexes the set.
This visualization should get along well with #Soumen's explanation in the accepted answer.
However, the book mainly describes Reduced Instruction Set Architectures (RISC). I am personally aware of MIPS and RISC-V versions. So, if you have an x86 in front of you, take this picture with a grain of salt, more as a concept visualization than as actual implementation.
If we divide the memory into cache line sized chunks(i.e. 32B chunks of memory), each of this chunks is called a block. Now when you try to access some memory address, the whole memory block(size 32B) containing that address will be placed to a cache line.
No each set is not responsible for 4096KB or one particular memory page. Multiple memory blocks from different memory pages can be mapped to same cache set.

Resources