How MTRR registers implemented? [closed] - memory-management

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
x86/x86-64 exposes MTRR (Memory-type-range-register) that can be useful to designate different portions of physical address space for different usages (e.g., Cacheable, Unchangeable, Writecombining, etc.).
My question is is anybody knows how these constrained on physical address space as defined by the MTRRs are enforced in hardware? On each memory access does the hardware check whether the physical address falls in a given range before the process decides whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly?
Thanks

Wikipedia says in the article MTRR that:
Newer (primarily 64-bit) x86 CPUs support a more advanced technique called Page Attribute Tables that allow for per-table setting of these modes, instead of having a limited number of low-granularity registers
So, for newer x86/x86_64 CPUs it is possible to say that MTRR may be implemented as additional technique to PAT (Page Attribute Tables). The place where PAT is stored in memory is the Page Table (some bits in Page Table Entry, or PTE) and in the CPU they are stored (cached) in the TLB table (it is part of MMU). TLB (and MMU) is already the place which is visited by every memory access. I think, it may be good place to control type of memory, even with MTRR(?)
But what if I stop guessing and will open the RTFM book? There is one very good book about x86 world: The Unabridged Pentium 4: IA32 Processor Genealogy (ISBN-13: 978-0321246561). Part 7, chapter 24 "Pentium Pro software enchancement", part "MTRR added".
There are long rules for every mtrr memory type at pages 582-584, but rules for all 5 types (Uncacheable=UC, Write-Combining=WC, Write-Through=WT, Write-Protect=WP, Write-Back=WB) begins with: "Cache lookups are performed".
And in Part 9 "Pentium III" chapter 32 "Pentium III Xeon" the book clearly says:
When it has to perform a memory access, the processor consults both the MTRRs and the selected PTE or PDE to determine the memory type (and therefore the rules of conduct it is to follow).
But from other side... WRMSR into MTRR regs will invalidate TLB (according to intel instruction manual "instruct32.chm"):
When the WRMSR instruction is used to write to an MTRR, the TLBs are invalidated, including the global entries (see "Translation Lookaside Buffers (TLBs)" in Chapter 3 of the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 3).
And there is one more direct hint in "Intel 64 and IA-32 Architectures Software developer manual, vol 3a", section "10.11.9 Large page considerations":
The MTRRs provide memory typing for a limited number of regions that have a 4 KByte granularity (the same granularity as 4-KByte pages). The memory type for a given page is cached in the processor’s TLBs.
You asked:
On each memory access does the hardware check whether the physical address falls in a given range
No. Every memory access is not compared with all MTRRs. All MTRRs ranges are precombined with PTEs bits of memory when PTE is loaded into TLB. Then the only place to check memory type will be TLB line. And the TLB IS checked for every memory access.
whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly
No, there is something that we don't understand clearly. Cache looked for every access, even for UC (e.g if region is just changed to UC there can be cached copy which should be evicted).
From chapter 24 (it is about Pentium 4):
Loads from Cacheable Memory
The types of memory that the processor is permitted to cache from are WP, WT and WB memory (as defined by the MTRRs and the PTE or PDE).
When the core dispatches a load mop, the mop is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is then issued to the L1 Data Cache for fulfillment:
If the cache has a copy of the line that contains the requested read data, the read data is placed in the Load Buffer.
If the cache lookup results in a miss, the request is forwarded upstream to the L2 Cache.
If the L2 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L1 Data Cache.
If the cache lookup results in a miss, the request is forwarded upstream to either the L3 Cache (if there is one) or to the FSB Interface Unit.
If the L3 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L2 Cache and the L1 Data Cache.
If the lookup in the top-level cache results in a miss, the request is forwarded to the FSB Interface Unit.
When the sector is returned from memory, the read data is immediately placed in the Load Buffer and the sector is copied into the L3 Cache (if there is one), the L2 Cache, and the L1 Data Cache.
The processor core is permitted to speculatively execute loads that read data from WC, WP, WT or WB memory space
Loads from Uncacheable Memory
The uncacheable memory types are UC and WC (as defined by the MTRRs and the PTE or PDE).
When the core dispatches a load mop, the read request is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is submitted to the processor's caches as well. In the event of cache hit, the cache line is evicted from the cache. The request is issued to the FSB Interface Unit. A Memory Data Read transaction is performed on the FSB to fetch just the requested bytes from memory. When the data is returned from memory, the read data is immediately placed in the Load Buffer.
The processor core is not permitted to speculatively execute loads that read data from UC memory space
Stores to UC Memory
UC is one of the two uncacheable memory types (the other is the WC memory type). When a store to UC memory is executed, it is posted in the Store Buffer reserved for it in the Allocator stage. Stores to UC memory are also submitted to the L1 Data Cache, the L2 Cache, or the L3 Cache (if there is one). In the event of a cache hit, the line is evicted from the cache.
When a Store Buffer containing a store to UC memory is forwarded to the FSB Interface Unit, a Memory Data Write transaction ... is performed on the FSB
Stores to WC Memory
The WC memory type is well-suited to an area of memory (e.g., the video frame buffer) that has the following characteristics:
The processor does not cache from WC memory.
Speculative execution of loads from WC memory is permitted.
Stores to WC memory are deposited in the processor's Write Combining Buffers (WCBs).
Each WCB can hold one line (64 bytes of data).
As stores are performed to a line of WC memory space, the bytes are accumulated in the WCB assigned to record writes to that line of memory space.
A subsequent store to a location in a WCB can overwrite a byte that was deposited in that location by an earlier store to that location. In other words, multiple writes to the same location are collapsed so that the location reflects the last data byte written to that location.
When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed. The device being written to must tolerate this type of behavior (i.e., it must function correctly). See "WCB FSB Transactions" on page 1080 for more information.
Stores to WT Memory
When a store to cacheable, Write-Through memory is executed. The store is posted in the Store Buffer that was reserved for its use in the Allocator stage. In addition, the store is submitted to the L1 Data Cache for a lookup. There are several possibilities:
* If the store hits on the Data Cache, the line in the cache is updated, but it remains in the S state (which means the line is valid).
* If the store misses the Data Cache, it is forwarded to the L2 Cache and a lookup is performed:
* - If it hits on a line in the L2 Cache, the line is updated, but it remains in the S state (which means the line is valid).
* - If it misses on the L2 Cache and there is no L3 Cache, no further action is taken.

Related

How caches are connected to cores?

I have very fundamental question on how physically (in RTL) caches (e.g. L1,L2) are connected to cores (e.g. Arm Cortex A53)? How many read/writes ports/bus are there and what is width of it? Is it 32-bit bus? How to calculate theoretical max bandwidth/throughput on L1 cache connected to Arm Cortex A53 running at 1400MHz?
On web lots of information is available on how caches work but couldn't find how it is connected.
You can get the information in the ARM documentation (which is pretty complete compared to others):
L1 data cache:
(configurable) sizes of 8KB, 16KB, 32KB, or 64KB.
Data side cache line length of 64 bytes.
256-bit write interface to the L2 memory system.
128-bit read interface to the L2 memory system.
64-bit read path from the data L1 memory system to the datapath.
128-bit write path from the datapath to the L1 memory system.
Note there is one datapath since it is mentioned when there are multiple of them, hence there is certainly 1 port unless 2 ports share the same datapath which would be surprising.
L2 cache:
All bus interfaces are 128-bits wide.
Configurable L2 cache size of 128KB, 256KB, 512KB, 1MB and 2MB.
Fixed line length of 64 bytes.
General information:
One to four cores, each with an L1 memory system and a single shared L2 cache.
In-order pipeline with symmetric dual-issue of most instructions.
Harvard Level 1 (L1) memory system with a Memory Management Unit (MMU).
Level 2 (L2) memory system providing cluster memory coherency, optionally including an L2 cache.
The Level 1 (L1) data cache controller, that generates the control signals for the associated embedded tag, data, and dirty RAMs, and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed, Physically Tagged (PIPT) scheme for lookup that enables unambiguous address management in the system.
The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. The STB can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write out the data on the external write channel. External data writes are through the SCU.
The STB can merge several store transactions into a single transaction if they are to the same 128-bit aligned address.
An upper-bound for the L1 bandwidth is frequency * interface_width * number_of_paths so 1400MHz * 64bit * 1 = 10.43 GiB/s from the L1 (reads) and 20.86 GiB/s to the L1 (writes). In practice, the concurrency can be a problem but it is hard to know which part of the chip will be a limiting factor.
Note that there are many other documents available but this one is the most interesting. I am not sure you can get the physical information about cache in RTL since I expect this information to be confidential, hence not publicly available (because I guess competitors could take benefit of this).

For Write-Back Cache Policy, why data should first be read from memory, before writing to cache?

Caches with Write Back Cache, perform write operations to the cache memory and return immediately. This is only when the data is already present in the cache. If the data is not present in the cache, it is first fetched from the lower memories, and then written in the cache.
I do not understand why it is important to first fetch the data from the memory, before writing it. If the data is to be written, it will become invalid anyways.
I do know the basic concept, but want to know the reason behind having to read data before writing to the address.
I have the following guess,
This is done for Cache Coherency, in a multi-processor environment. Other processors snoop on the bus to maintain Cache Coherency. The processor writing on the address needs to gain an exclusive access, and other processors must find out about this.
But, does that mean, this is not required on Single-Processor computers?
Short answer
A write that miss in the cache may or may not fetch the block being written depending on the write-miss policy of the cache (fetch-on-write-miss vs. no-fetch-on-write-miss).
It does not depend on the write-hit policy (write-back vs. write-through).
Explanation
In order to simplify, let us assume that we have a one-level cache hierarchy:
----- ------ -------------
|CPU| <-> | L1 | <-> |main memory|
----- ------ -------------
The L1 write-policy is fetch-on-write-miss.
The cache stores blocks of data. A typical L1 block is 32 bytes width, that is, it contains several words (for instance, 8 x 4-bytes words).
The transfer unit between the cache and main memory is a block, but transfers between CPU and cache can be of different sizes (1, 2, 4 or 8 bytes).
Let us assume that the CPU performs a 4-byte word write.
If the block containing the word is not stored at the cache, we have a cache miss. The whole block (32 bytes) is transferred from main memory to the cache, and then the corresponding word (4 bytes) is stored in the cache.
A write-back cache would tag the block as dirty (not invalid, as you stated).
A write-through cache would send the updated word to main memory.
If the block containing the word is stored at the cache, we have a cache hit. The corresponding word is updated.
More information:
Cache Write Policies and Performance. Norman P. Jouppi.
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf
Your guess is almost correct. However this behavior has to be done also in multi-core single processor systems.
Your processor can have multiple cores, therefore when writing a cache line (in a WB cache), the core that issues the write needs to get exclusive access to that line. If the line intended for write is marked as dirty it will be "flushed" to the lower memories before being written with the new information.
In a multi-core CPU, each core has it's own L1 cache and there is the possibility that each core could store a copy of a shared L2 line. Therefore you need this behavior for Cache Coherency.
You should find out more by reading about MESI protocol and it's derivations.

Does a cache line flush access the TLB?

Assuming that we have intentionally thrashed the DTLB, and would like to proceed to flush a specific cache line from L1-3 using clflush on a memory region which is (most likely) disjoint from the addresses pointed to by the TLB entries; would this in fact bring the page base address of the cache line we are flushing back into the TLB?
In short, does a clflush touch the TLB at all? I'm assuming that due to this instruction honouring coherency, it will subsequently write that line back to memory (obviously needing a TLB look-up.)
From Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-L: "Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction) ."
Since it uses the linear (virtual) address, the address needs to be translated, which means that a page table walk would be needed on a TLB miss. (This would generally be the case even for a different kind of instruction that pushed cache entries out to higher levels of cache since L1 caches are typically physically tagged for x86. In general, tagging with the virtual address has fallen out of favor. Using the physical address for tags means that the physical address is needed to check the cache for a hit, so even if it was not sent to memory, translation would be needed.)
While it would be possible to avoid loading the TLB for such accesses, the extra complexity of such special-case handling would almost certainly not be viewed as worth the bother given that CLFLUSH is not commonly used.

TLB misses vs cache misses?

Could someone please explain the difference between a TLB (Translation lookaside buffer) miss and a cache miss?
I believe I found out TLB refers to some sort of virtual memory address but I wasn't overly clear what this actually meant?
I understand cache misses result when a block of memory (the size of a cache line) is loaded into the (L3?) cache and if a required address is not held within the current cache lines- this is a cache miss.
Well, all of today's modern operating systems use something called virtual memory. Every address generated by CPU is virtual. There are page tables that map such virtual addresses to physical addressed. And a TLB is just a cache of page table entries.
On the other hand L1, L2, L3 caches cache main memory contents.
A TLB miss occurs when the mapping of virtual memory address => physical memory address for a CPU requested virtual address is not in TLB. Then that entry must be fetched from page table into the TLB.
A cache miss occurs when the CPU requires something that is not in the cache. The data is then looked for in the primary memory (RAM). If it is not there, data must be fetched from secondary memory (hard disk).
The following sequence after loading first instruction address (i.e. virtual address) in PC makes concept of TLB miss and cache miss very clear.
The first instruction
• Accessing the first instruction
Take the starting PC
Access iTLBwith the VPN extracted from PC: iTLBmiss
Invoke iTLBmiss handler
Calculate PTE address
If PTEsare cached in L1 data and L2 caches, look them up with PTE address: you will miss there also
Access page table in main memory: PTE is invalid: page fault
Invoke page fault handler
Allocate page frame, read page from disk, update PTE, load PTE in iTLB, restart fetch
• Now you have the physical address
Access Icache: miss
Send refill request to higher levels: you miss everywhere
Send request to memory controller (north bridge)
Access main memory
Read cache line
Refill all levels of cache as the cache line returns to the processor
Extract the appropriate instruction from the cache line with the block offset
• This is the longest possible latency in an instruction/data access
source https://software.intel.com/en-us/articles/recap-virtual-memory-and-cache
As the HOW of both the processes are mentioned. On the note of performance, a cache miss does not necessarily stall the CPU. A small number of cache misses can be tolerated using algorithmic pre-fetching techniques. A TLB miss however causes the CPU to stall till the TLB has been updated with the new address. In other words prefetching can mask a cache miss but not a TLB miss.

What is TLB shootdown?

What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.

Resources