Architected TLB vs. Architected Page Table - memory-management

What is the difference between architected TLB and architected page table?

A TLB is a hardware structure not unlike a cache or a register file. It resides inside the processor. A page table is a structure in main memory. Wikipedia calls architected TLBs "software-managed TLBs" and an architected page table a "hardware-managed TLB".
The difference between which is architected is only important for the implementation of virtual memory. In case of an architected TLB the operating system has to manipulate the TLB directly. Because the capacity of the TLB is limited, the operating system will likely have an internal structure resembling a page table for each process. A downside of an architected TLB is the high cost to bring in a new entry by software. Another is that the number of TLB entries is fixed across different processor generations. An example of this approach is MIPS.
A processor with an architected page table will likely have a TLB too. But it is transparent to software which only sees the page table. This makes TLB refills cheaper and allows to use a different TLB (e.g. bigger, multi-level) for each processor generation. The downside is additional complexity as the processor has to detect updates of the page table transparently and needs logic to perform the page table walks. An example of this approach is x86.

Related

Will page table be put in CPU cache?

According to my understanding, load/store operations would access some data of a virtual memory address(vaddr), and this vaddr would be translated into physical address(paddr) in order to be fulfilled by the memory hierarchy.
The translation process would first look up in TLB, if no match is found, a multi level(?) page table look up is then triggered.
My question is: will the page table be put in L1D cache, L2 cache or LLC, besides the quite limited TLB entries?

Is the TLB shared between multiple cores?

I've heard that TLB is maintained by the MMU not the CPU cache.
Then Does One TLB exist on the CPU and is shared between all processor or each processor has its own TLB cache?
Could anyone please explain relationship between MMU and L1, L2 Cache?
The TLB caches the translations listed in the page table. Each CPU core can be running in a different context, with different page tables. This is what you'd call the MMU, if it was a separate "unit", so each core has its own MMU. Any shared caches are always physically-indexed / physically tagged, so they cache based on post-MMU physical address.
The TLB is a cache (of PTEs), so technically it's just an implementation detail that could vary by microarchitecture (between different implementations of the x86 architecture).
In practice, all that really varies is the size. 2-level TLBs are common now, to keep full TLB misses to a minimum but still be fast enough allow 3 translations per clock cycle.
It's much faster to just re-walk the page tables (which can be hot in local L1 data or L2 cache) to rebuild a TLB entry than to try to share TLB entries across cores. This is what sets the lower bound on what extremes are worth going to in avoiding TLB misses, unlike with data caches which are the last line of defence before you have to go off-core to shared L3 cache, or off-chip to DRAM on an L3 miss.
For example, Skylake added a 2nd page-walk unit (to each core). Good page-walking is essential for workloads where cores can't usefully share TLB entries (threads from different processes, or not touching many shared virtual pages).
A shared TLB would mean that invlpg to invalidate cached translations when you do change a page table would always have to go off-core. (Although in practice an OS needs to make sure other cores running other threads of a multi-threaded process have their private TLB entries "shot down" during something like munmap, using software methods for inter-core communication like an IPI (inter-processor interrupt).)
But with private TLBs, a context switch to a new process can just set a new CR3 (top-level page-directory pointer) and invalidate this core's whole TLB without having to bother other cores or track anything globally.
There is a PCID (process context ID) feature that lets TLB entries be tagged with one of 16 or so IDs so entries from different process's page tables can be hot in the TLB instead of needing to be flushed on context switch. For a shared TLB you'd need to beef this up.
Another complication is that TLB entries need to track "dirty" and "accessed" bits in the PTE. They're necessarily just a read-only cache of PTEs.
For an example of how the pieces fit together in a real CPU, see David Kanter's writeup of Intel's Sandybridge design. Note that the diagrams are for a single SnB core. The only shared-between-cores cache in most CPUs is the last-level data cache.
Intel's SnB-family designs all use a 2MiB-per-core modular L3 cache on a ring bus. So adding more cores adds more L3 to the total pool, as well as adding new cores (each with their own L2/L1D/L1I/uop-cache, and two-level TLB.)

What is the difference between demand paging and page replacement?

From what I understand, demand paging is basically paging with swapping, so you can swap in a page when it is needed. But page replacement seems like more or less the same thing, where you bring in a page is needed and switching it with an existing page in physical memory.
So is there a distinct difference?
In a system that uses demand paging, the operating system copies a disk page into physical memory only if an attempt is made to access it and that page is not already in memory (i.e., if a page fault occurs). It follows that a process begins execution with none of its pages in physical memory, and many page faults will occur until most of a process's working set of pages is located in physical memory. This is an example of a lazy loading technique.
From Wikipedia's Demand paging:
Demand paging follows that pages should only be brought into memory if
the executing process demands them. This is often referred to as lazy
evaluation as only those pages demanded by the process are swapped
from secondary storage to main memory. Contrast this to pure swapping,
where all memory for a process is swapped from secondary storage to
main memory during the process startup.
Whereas, page replacement is simply the technique which is done when there occurs a page-fault. Page replacement is a technique which is utilised for both pure swapping and demand-paging.
Page Replacement simply means swapping two processes between memory and disk.
Demand Paging is a concept in which only required pages are brought into the memory. In case where a page required is not in the memory, the system looks for free frames in the memory. If there are no free frames, then a page replacement is done to bring the required page from the disk to the memory.

CPU cycle speed

Finding the latencies of L1/L2/L3 caches is easy:
Approximate cost to access various caches and main memory?
but I am interested in what the cost is (in CPU cycles) for translating a virtual address to physical page address when:
There is a hit in the L1 TLB
There is a miss in the L1 TLB but a hit in the L2 TLB
There is a miss in the L2 TLB and a hit in the page table
(I dont think there can be a miss in the page table can there? If there can, cost of this)
I did find this:
Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access
TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access
Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY
PDE cache = 32 items?
http://www.7-cpu.com/cpu/SandyBridge.html
but it doesn't mention the cost of a hit/accessing the relevant TLB cache?
Typically the L1 TLB access time will be less than the cache access time to allow tag comparison in a set associative, physically tagged cache. A direct mapped cache can delay the tag check by assuming a hit. (For an in-order processor, a miss with immediate use of the data would need to wait for the miss to be handled, so there is no performance penalty. For an out-of-order processor, correcting for such wrong speculation can have noticeable performance impact. While an out-of-order process is unlikely to use a direct mapped cache, it may use way prediction which can behave similarly.) A virtually tagged cache can (in theory) delay TLB access even more since the TLB is only needed to verify permissions not to determine a cache hit and the handling of permission violations is generally somewhat expensive and rare.
This means that L1 TLB access time will generally not be made public since it will not influence software performance tuning.
L2 hit time would be equivalent to the L1 miss penalty. This will vary depending on the specific implementation and may not be a single value. E.g., If the TLB uses banking to support multiple accesses in a single cycle, bank conflicts can delay accesses, or if rehashing is used to support multiple page sizes, a page of the alternate size will take longer to find (both of these cases can accumulate delay under high utilization).
The time required for an L2 TLB fill can vary greatly. ARM and x86 use hardware TLB fill using a multi-level page table. Depending on where page table data can be cached and whether there is a cache hit, the latency of a TLB fill can be between the latency of a main memory access for each level of the page table and the latency of the cache where the page table data is found for each level (plus some overhead).
Complicating this further, more recent Intel x86 have paging-structure caches which allow levels of the page table to be skipped. E.g., if a page directory entry (an entry in a second level page table which points to a page of page table entries) is found in this cache, rather than starting from the base of the page table and doing four dependent look-ups only a single look-up is required.
(It might be worth noting that using a page the size of the virtual address region covered by a level of the page table (e.g., 2 MiB and 1 GiB for x86-64), reduces the depth of the page table hierarchy. Not only can using such large pages reduce TLB pressure, but it can also reduce the latency of a TLB miss.)
A page table miss is handled by the operating system. This might result in the page still being in memory (e.g., if the write to swap has not been completed) in which case the latency will be relatively small. (The actual latency will depend on how the operating system implements this and on cache hit behavior, though cache misses both for the code and the data are likely since paging is an uncommon event.) If the page is no longer in memory, the latency of reading from secondary storage (e.g., a disk drive) is added to the software latency of handling an invalid page table entry (i.e., a page table miss).

What is TLB shootdown?

What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.

Resources