TLB misses vs cache misses? - performance

Could someone please explain the difference between a TLB (Translation lookaside buffer) miss and a cache miss?
I believe I found out TLB refers to some sort of virtual memory address but I wasn't overly clear what this actually meant?
I understand cache misses result when a block of memory (the size of a cache line) is loaded into the (L3?) cache and if a required address is not held within the current cache lines- this is a cache miss.

Well, all of today's modern operating systems use something called virtual memory. Every address generated by CPU is virtual. There are page tables that map such virtual addresses to physical addressed. And a TLB is just a cache of page table entries.
On the other hand L1, L2, L3 caches cache main memory contents.
A TLB miss occurs when the mapping of virtual memory address => physical memory address for a CPU requested virtual address is not in TLB. Then that entry must be fetched from page table into the TLB.
A cache miss occurs when the CPU requires something that is not in the cache. The data is then looked for in the primary memory (RAM). If it is not there, data must be fetched from secondary memory (hard disk).

The following sequence after loading first instruction address (i.e. virtual address) in PC makes concept of TLB miss and cache miss very clear.
The first instruction
• Accessing the first instruction
Take the starting PC
Access iTLBwith the VPN extracted from PC: iTLBmiss
Invoke iTLBmiss handler
Calculate PTE address
If PTEsare cached in L1 data and L2 caches, look them up with PTE address: you will miss there also
Access page table in main memory: PTE is invalid: page fault
Invoke page fault handler
Allocate page frame, read page from disk, update PTE, load PTE in iTLB, restart fetch
• Now you have the physical address
Access Icache: miss
Send refill request to higher levels: you miss everywhere
Send request to memory controller (north bridge)
Access main memory
Read cache line
Refill all levels of cache as the cache line returns to the processor
Extract the appropriate instruction from the cache line with the block offset
• This is the longest possible latency in an instruction/data access
source https://software.intel.com/en-us/articles/recap-virtual-memory-and-cache

As the HOW of both the processes are mentioned. On the note of performance, a cache miss does not necessarily stall the CPU. A small number of cache misses can be tolerated using algorithmic pre-fetching techniques. A TLB miss however causes the CPU to stall till the TLB has been updated with the new address. In other words prefetching can mask a cache miss but not a TLB miss.

Related

Will page table be put in CPU cache?

According to my understanding, load/store operations would access some data of a virtual memory address(vaddr), and this vaddr would be translated into physical address(paddr) in order to be fulfilled by the memory hierarchy.
The translation process would first look up in TLB, if no match is found, a multi level(?) page table look up is then triggered.
My question is: will the page table be put in L1D cache, L2 cache or LLC, besides the quite limited TLB entries?

cache miss, a TLB miss and page fault

Can someone clearly explain me the difference between a cache miss, a tlb miss and page fault, and how do these affect the effective memory access time?
Let me explain all these things step by step.
The CPU generates the logical address, which contains the page number and the page offset.
The page number is used to index into the page table, to get the corresponding page frame number, and once we have the page frame of the physical memory(also called main memory), we can apply the page offset to get the right word of memory.
Why TLB(Translation Look Aside Buffer)
The thing is that page table is stored in physical memory, and sometimes can be very large, so to speed up the translation of logical address to physical address , we sometimes use TLB, which is made of expensive and faster associative memory, So instead of going into page table first, we go into the TLB and use page number to index into the TLB, and get the corresponding page frame number and if it is found, we completely avoid page table( because we have both the page frame number and the page offset) and form the physical address.
TLB Miss
If we don't find the page frame number inside the TLB, it is called a TLB miss only then we go to the page table to look for the corresponding page frame number.
TLB Hit
If we find the page frame number in TLB, its called TLB hit, and we don't need to go to page table.
Page Fault
Occurs when the page accessed by a running program is not present in physical memory. It means the page is present in the secondary memory but not yet loaded into a frame of physical memory.
Cache Hit
Cache Memory is a small memory that operates at a faster speed than physical memory and we always go to cache before we go to physical memory. If we are able to locate the corresponding word in cache memory inside the cache, its called cache hit and we don't even need to go to the physical memory.
Cache Miss
It is only after when mapping to cache memory is unable to find the corresponding block(block similar to physical memory page frame) of memory inside cache ( called cache miss ), then we go to physical memory and do all that process of going through page table or TLB.
So the flow is basically this
1.First go to the cache memory and if its a cache hit, then we are done.
2. If its a cache miss, go to step 3.
3. First go to TLB and if its a TLB hit, go to physical memory using physical address formed, we are done.
4. If its a TLB miss, then go to page table to get the frame number of your page for forming the physical address.
5. If the page is not found, its a page fault.Use one of the page replacement algorithms if all the frames are occupied by some page else just load the required page from secondary memory to physical memory frame.
End Note
The flow I have discussed is related to virtual cache(VIVT)(faster but not sharable between processes), the flow would definitely change in case of physical cache(PIPT)(slower but can be shared between processes). Cache can be addressed in multiple ways. If you are willing to dive deeply have a look at this and this.
This diagram might help to see what will happen when there is a hit or a miss.
Just imagine a process is running and requires a data item X.
At first cache memory will be checked to see if it has the requested data item, if it is there(cache hit), it will be returned.If it is not there(cache miss), it will be loaded from main memory.
If there is a cache miss main memory will be checked to see if there is page containing the requested data item(page hit) and if such page is not there (page fault), the page containing the desired item has to be brought into main memory from disk.
While processing the page fault TLB will be checked to see if the desired page's frame number is available there (TLB hit) otherwise (TLB miss)OS has to consult page table for servicing page fault.
Time required to access these types memories:
cache << main memory << disk
Cache access requires least time so a hit or miss at certain level drastically changes the effective access time.
What causes page faults? Is it always because the memory has been
moved to hard disk? Or just moved around for other applications?
Well, it depends. If your system does not support multiprogramming(In a multiprogramming system there are one or more programs loaded in main memory which are ready to execute), then definitely page fault has occurred because memory has been moved to hard disk.
If your system does support multiprogramming, then it depends on whether your operating system uses global page replacement or local page replacement. If it uses global, then yes there is a chance that memory has been moved around for other applications. But in local, the memory has been moved back to hard disk. When a process incurs a page fault, a local page replacement algorithm selects for replacement some page that belongs to that same process. On the other hand a global replacement algorithm is free to select any page in from the entire pool of frames. This discussion about these pops up more when dealing with thrashing.
I am confused of the difference between TLB miss and page faults.
TLB miss occurs when the page table entry required for conversion of virtual address to physical address is not present in the TLB(translation look aside buffer). TLB is like a cache, but it does not store data rather it stores page table entries so that we can completely bypass the page table in case of TLB hit as you can see in the diagram.
Is page fault a crash? Or is it the same as a TLB miss?
Neither of them is a crash as crash is not recoverable. But it is well known that we can recover from both page fault and TLB miss without any need for aborting the process execution.
The Operating system uses virtual memory and page tables maps these virtual address to physical address. TLB works as a cache for such mapping.
program >>> TLB >>> cache >>> Ram
A program search for a page in TLB, if it doesn't find that page it's a TLB miss and then further looks for the page in cache.
If the page is not in cache then it's a cache miss and further looks for the page in RAM.
If the page is not in RAM, then it's a page fault and program look for the data in secondary storage.
So, typical flow would be
Page Requested >> TLB miss >> cache miss >> page fault >> looks in secondary memory.

Virtual Address to Physical address translation in the light of the cache memory

I do understand how the a virtual address is translated to a physical address to access the main memory. I also understand how the cache memory works as well.
But my problem is in putting the 2 concepts together and understanding the big picture of how a process accesses memory and what will happen if we have a cache miss. so i have this drawing that will help me asks the following questions:
click to see the image ( assume one-level cache)
1- Does the process access the cache with the exact same physical address that represent the location of byte in the main memory ?
2- Is the TLB actually in the first level of Cache or is it a separate memory inside the CPU chip that is dedicated for the translation purpose ?
3- When there is a cache miss, i need to get a whole block and allocated in the cache, but the main memory organized in frames(pages) not blocks. So does a process page is divided itself to cache blocks that can be brought to cache in case of a miss ?
4- Lets assume there is a TLB miss, does that mean that I need to go all the way to the main memory and do the page walk there , or does the page walk happen in the cache ?
5- Does a TLB miss guarantee that there will be a cache miss ?
6- If you have any reading material that explain the big picture that i am trying to understand i would really appreciate sharing it with me.
Thanks and feel free to answer any single question i have asked
Yes. The cache is not memory that can be addressed separately. Cache mapping will translate a physical address into an address for the cache but this mapping is not something a process usually controls. For some CPU architecture it is completely controlled by the hardware (e.g. Intel x86). For others the operating system would be expected to program the mapping.
The TLB in the diagram you gave is for virtual to physical address mapping. It is probably not for the cache. Again on some architecture the TLBs are programmed whereas on others it is controlled by the hardware.
Page size and cache line size do not have to be the same as one relates to virtual memory and the other to physical memory. When a process access a virtual address that address will be translated to a physical address using the TLB considering page size. Once that's done the size of a page is of no concern. The access is for a byte/word at a physical address. If this causes a cache miss occurs then the cache block that will be read will be of the size of a cache block that covers the physical memory address that's being accessed.
TLB miss will require a page translation by reading other memory. This process can occur in hardware on some CPU (such as Intel x86/x64) or need to be handled in software. Once the page translation has been completed the TLB will be reloaded with the page translation.
TLB miss does not imply cache miss. TLB miss just means the virtual to physical address mapping was not known and required a page address translation to occur. A cache miss means the physical memory content could not be provided quickly.
To recap:
the TLB is to convert virtual addresses to physical address quickly. It exist to cache the virtual to physical memory mapping quickly. It does not have anything to do with physical memory content.
the cache is to allow faster access to memory. It is only there to provide the content of physical memory faster.
Keep in mind that the term cache can be used for lots of purposes (e.g. note the usage of cache when describing the TLB). TLB is a bit more specific and usually implies a virtual memory translation though that's not universal. For example some DMA controllers have a TLB too but that TLB is not necessarily used to translate virtual to physical addresses but rather to convert block addresses to physical addresses.

How MTRR registers implemented? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
x86/x86-64 exposes MTRR (Memory-type-range-register) that can be useful to designate different portions of physical address space for different usages (e.g., Cacheable, Unchangeable, Writecombining, etc.).
My question is is anybody knows how these constrained on physical address space as defined by the MTRRs are enforced in hardware? On each memory access does the hardware check whether the physical address falls in a given range before the process decides whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly?
Thanks
Wikipedia says in the article MTRR that:
Newer (primarily 64-bit) x86 CPUs support a more advanced technique called Page Attribute Tables that allow for per-table setting of these modes, instead of having a limited number of low-granularity registers
So, for newer x86/x86_64 CPUs it is possible to say that MTRR may be implemented as additional technique to PAT (Page Attribute Tables). The place where PAT is stored in memory is the Page Table (some bits in Page Table Entry, or PTE) and in the CPU they are stored (cached) in the TLB table (it is part of MMU). TLB (and MMU) is already the place which is visited by every memory access. I think, it may be good place to control type of memory, even with MTRR(?)
But what if I stop guessing and will open the RTFM book? There is one very good book about x86 world: The Unabridged Pentium 4: IA32 Processor Genealogy (ISBN-13: 978-0321246561). Part 7, chapter 24 "Pentium Pro software enchancement", part "MTRR added".
There are long rules for every mtrr memory type at pages 582-584, but rules for all 5 types (Uncacheable=UC, Write-Combining=WC, Write-Through=WT, Write-Protect=WP, Write-Back=WB) begins with: "Cache lookups are performed".
And in Part 9 "Pentium III" chapter 32 "Pentium III Xeon" the book clearly says:
When it has to perform a memory access, the processor consults both the MTRRs and the selected PTE or PDE to determine the memory type (and therefore the rules of conduct it is to follow).
But from other side... WRMSR into MTRR regs will invalidate TLB (according to intel instruction manual "instruct32.chm"):
When the WRMSR instruction is used to write to an MTRR, the TLBs are invalidated, including the global entries (see "Translation Lookaside Buffers (TLBs)" in Chapter 3 of the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 3).
And there is one more direct hint in "Intel 64 and IA-32 Architectures Software developer manual, vol 3a", section "10.11.9 Large page considerations":
The MTRRs provide memory typing for a limited number of regions that have a 4 KByte granularity (the same granularity as 4-KByte pages). The memory type for a given page is cached in the processor’s TLBs.
You asked:
On each memory access does the hardware check whether the physical address falls in a given range
No. Every memory access is not compared with all MTRRs. All MTRRs ranges are precombined with PTEs bits of memory when PTE is loaded into TLB. Then the only place to check memory type will be TLB line. And the TLB IS checked for every memory access.
whether it should look up the cache or lookup the writecombining buffer or send it to memory controller directly
No, there is something that we don't understand clearly. Cache looked for every access, even for UC (e.g if region is just changed to UC there can be cached copy which should be evicted).
From chapter 24 (it is about Pentium 4):
Loads from Cacheable Memory
The types of memory that the processor is permitted to cache from are WP, WT and WB memory (as defined by the MTRRs and the PTE or PDE).
When the core dispatches a load mop, the mop is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is then issued to the L1 Data Cache for fulfillment:
If the cache has a copy of the line that contains the requested read data, the read data is placed in the Load Buffer.
If the cache lookup results in a miss, the request is forwarded upstream to the L2 Cache.
If the L2 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L1 Data Cache.
If the cache lookup results in a miss, the request is forwarded upstream to either the L3 Cache (if there is one) or to the FSB Interface Unit.
If the L3 Cache has a copy of the sector that contains the requested read data, the read data is immediately placed in the Load Buffer and the sector is copied into the L2 Cache and the L1 Data Cache.
If the lookup in the top-level cache results in a miss, the request is forwarded to the FSB Interface Unit.
When the sector is returned from memory, the read data is immediately placed in the Load Buffer and the sector is copied into the L3 Cache (if there is one), the L2 Cache, and the L1 Data Cache.
The processor core is permitted to speculatively execute loads that read data from WC, WP, WT or WB memory space
Loads from Uncacheable Memory
The uncacheable memory types are UC and WC (as defined by the MTRRs and the PTE or PDE).
When the core dispatches a load mop, the read request is placed in the Load Buffer that was reserved for it in the Allocator stage. The memory data read request is submitted to the processor's caches as well. In the event of cache hit, the cache line is evicted from the cache. The request is issued to the FSB Interface Unit. A Memory Data Read transaction is performed on the FSB to fetch just the requested bytes from memory. When the data is returned from memory, the read data is immediately placed in the Load Buffer.
The processor core is not permitted to speculatively execute loads that read data from UC memory space
Stores to UC Memory
UC is one of the two uncacheable memory types (the other is the WC memory type). When a store to UC memory is executed, it is posted in the Store Buffer reserved for it in the Allocator stage. Stores to UC memory are also submitted to the L1 Data Cache, the L2 Cache, or the L3 Cache (if there is one). In the event of a cache hit, the line is evicted from the cache.
When a Store Buffer containing a store to UC memory is forwarded to the FSB Interface Unit, a Memory Data Write transaction ... is performed on the FSB
Stores to WC Memory
The WC memory type is well-suited to an area of memory (e.g., the video frame buffer) that has the following characteristics:
The processor does not cache from WC memory.
Speculative execution of loads from WC memory is permitted.
Stores to WC memory are deposited in the processor's Write Combining Buffers (WCBs).
Each WCB can hold one line (64 bytes of data).
As stores are performed to a line of WC memory space, the bytes are accumulated in the WCB assigned to record writes to that line of memory space.
A subsequent store to a location in a WCB can overwrite a byte that was deposited in that location by an earlier store to that location. In other words, multiple writes to the same location are collapsed so that the location reflects the last data byte written to that location.
When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed. The device being written to must tolerate this type of behavior (i.e., it must function correctly). See "WCB FSB Transactions" on page 1080 for more information.
Stores to WT Memory
When a store to cacheable, Write-Through memory is executed. The store is posted in the Store Buffer that was reserved for its use in the Allocator stage. In addition, the store is submitted to the L1 Data Cache for a lookup. There are several possibilities:
* If the store hits on the Data Cache, the line in the cache is updated, but it remains in the S state (which means the line is valid).
* If the store misses the Data Cache, it is forwarded to the L2 Cache and a lookup is performed:
* - If it hits on a line in the L2 Cache, the line is updated, but it remains in the S state (which means the line is valid).
* - If it misses on the L2 Cache and there is no L3 Cache, no further action is taken.

What is TLB shootdown?

What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.

Resources