Since different processes have their own Page table, How does the TLB cache differentiate between two page tables?
Or is the TLB flushed every time a different process gets CPU?
Yes, setting a new top-level page table phys address (such as x86 mov cr3, rax) invalidates all existing TLB entries1, or on other ISAs possibly software would need to use additional instructions to ensure safety. (I'm guessing about that, I only know how x86 does it).
Some ISAs do purely software management of TLBs, in which case it would definitely be up to software to flush all or at least the non-global TLB entries on context switch.
A more recent CPU feature allows us to avoid full invalidations in some cases. A context ID gives some extra tag bits with each TLB entry, so the CPU can keep track of which page-table they came from and only hit on entries that match the current context. This way, frequent switches between a small set of page tables can keep some entries valid.
On x86, the relevant feature is PCID (Process Context ID): When the OS sets a new top-level page-table address, it's associated with a context ID number. (maybe 4 bits IIRC on current CPUs). It's passed in the low bits of the page-table address. Page-tables have to be page aligned so those bits are actually unused; this feature repurposes them to be a separate bitfield, with CR3 bits above the page-offset used normally as the physical page-number.
And the OS can tell the CPU whether or not to flush the TLB when it loads a new page table, for either switching back to a previous context, or recycling a context-ID for a different task. (By setting the high bit of the new CR3 value, mov cr, reg manual entry.)
x86 PCID was new in 2nd-gen Nehalem: https://www.realworldtech.com/westmere/ has a brief description of it from a CPU-architecture PoV.
Similar support I think extends to HW virtualization / nested page tables, to reduce the cost of hypervisor switches between guests.
I expect other ISAs that have any kind of page-table context mechanism work broadly similarly, with it being a small integer that the OS sets along with / as part of a new top-level page-table address.
Footnote 1: Except for "global" ones where the PTE indicates that this page will be mapped the same in all page tables. This lets OSes optimize by marking kernel pages that way, so those TLB entries can stay hot when the kernel context-switches user-space tasks. Both page tables should actually have valid entries for that page that do map to the same phys address, of course. On x86 at least, there is a bit in the PTE format that lets the CPU know it can assume the TLB entry is still valid across different page directories.
Related
I know a cache is not flushed at context switch. So if the new process has demanded a page that maps to the same physical address as the previous process (and the prev process is now swapped to disk), the contents of the previous process would still be cached. Wouldn't they be accessed by the new process when it tries to access its part of the memory from the physically mapped cache?
if the new process has its virtual memory mapped to same physical
address as the previous one.
For most contemporary Operating systems, Answer to your question is Paging.
Let say you have an OS with 4GB of addressable memory but the physical memory installed on the system is only 2GB. Further assume that one process with 1.5GB of memory requirement is active, so it looks as something as following.
Because there is enough physical memory, complete virtual address space of this process is mapped to physical memory.
Now lets say a new process with 1.5GB memory requirement enters the system. since there is not enough physical memory for both the processes, address space of first process may be mapped to disk(paged out), given that second process actively needs entire 1.5GB of its space.
So now the situation looks as follows.
Please note that from the perspective of first process everything is as before,
as soon as it becomes active and uses its virtual space, The OS will page back in its memory space stored on disk to the physical memory.
I know a cache is not flushed at context switch. So if the new process
has demanded a page that maps to the same physical address as the
previous process (and the prev process is now swapped to disk), the
contents of the previous process would still be cached of memory
protection.
Your premise is not entirely correct, where did you read that cache will not be flushed?
The question of invalidating caches on context switch is dependent on factors, some of which are not under control of an OS.
Some OS implementation do flush the caches (read below) and the ones that dont do it, require special support from the hardware. Either way any OS worth its salt will make sure that invalid data is not served to any process.
Following is relevant text from some very good OS books.
From Understanding the Linux Kernel
Table 2-11. Architecture-independent TLB-invalidating methods
Method name -- flush_tlb
Description -- Flushes all TLB entries of the non-global
pages owned by the current process
Typically used when -- Performing a process switch
If the CPU switches to another process that is using the same set of
page tables as the kernel thread that is being replaced, the kernel
invokes _ _flush_tlb() to invalidate all non-global TLB entries of the
CPU.
From Modern Operating Systems
The presence of caching and the MMU can have a major impact on
performance. In a multiprogramming system, when switching from one
program to another, sometimes called a context switch, it may be
necessary to flush all modified blocks from the cache and change the
mapping registers in the MMU.
and From the Operating System - three easy pieces
One approach is to simply flush the TLB on context switches, thus
emptying it before running the next process. On a software-based
system, this can be accomplished with an explicit (and privileged)
hardware instruction; with a hardware-managed TLB, the flush could be
enacted when the page-table base register is changed (note the OS must
change the PTBR on a context switch anyhow). In either case, the flush
operation simply sets all valid bits to 0, essentially clearing the
contents of the TLB. By flushing the TLB on each context switch, we
now have a working solution, as a process will never accidentally
encounter the wrong translations in the TLB.
On the other hand To reduce this overhead, some systems add hardware
support to enable sharing of the TLB across context switches. In
particular, some hardware systems provide an address space identifier
(ASID) field in the TLB. You can think of the ASID as a process
identifier (PID), but usually it has fewer bits (e.g., 8 bits for the
ASID versus 32 bits for a PID). If we take our example TLB from above
and add ASIDs, it is clear processes can readily share the TLB: only
the ASID field is needed to differentiate otherwise identical
translations.
Suppose two address spaces share a largish lump of non-contiguous memory.
The system might want to share physical page table(s) between them.
These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.
There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.
Does the page walk take explicit advantage of this in any known architecture?
If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?
Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.
On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.
What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.
("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)
Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.
However, I don't think you'll find that PDE caching happening across a change in the top-level page-tables.
On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).
The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)
So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.
Without hardware-managed coherence between page-table stores and TLB/pagewalk, there's no way this cache could happen safely.
That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/
So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.
Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.
So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.
Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.
The benefit of this would be tiny
After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.
So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).
All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.
L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.
If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.
I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.
In the segmentation scheme, everytime a memory access is made, the MMU would do a translation from to the actual address by looking up the segment table.
Is the segment table stored inside the TLB or in RAM ?
Is the segment table stored inside the TLB or in RAM ?
This depends on which type of CPU and which mode the CPU is in.
For 80x86, when a segment register is loaded the CPU stores "base address, limit and attributes" for the segment in a hidden part of the segment register.
For real mode, virtual8086 mode and system management mode, when a segment register is loaded the CPU just does "hidden segment base = segment value * 16" and there's no tables in RAM.
For protected mode and long mode, when a segment register is loaded the CPU uses the value being loaded into the segment register as an index into a table in RAM, and (after doing protection checks) loads the "base address, limit and attributes" information from the corresponding table entry into the hidden part of the segment register.
Note that (for protected mode) almost nobody used segmentation because the segment register loads are slow (due to protection checks and table lookups); so CPU manufacturers optimised the CPU for "no segmentation" (e.g. if segment bases are zero, instead of doing "linear address = virtual address + segment base" a modern CPU will just do "linear address = virtual address" and avoid the cost of an unnecessary addition and start cache/memory lookup sooner) and didn't bother optimising segment register loads much either; and then when AMD designed long mode they realised nobody wants segmentation and disabled most of it for 64-bit code (ignoring segment bases for most segment registers to get rid of the extra addition, and ignoring segment limits to get rid of the cost of segment limit checks). However, operating systems that don't use segmentation were using gs and fs as a hack to get fast access to CPU specific or thread specific data (because, unlike some other CPUs, 80x86 doesn't have register/s that can only be modified by supervisor code that would be more convenient for this purpose); so AMD kept the "linear address = virtual address + segment base" behaviour for these 2 segment registers and added the ability to modify the hidden "base address" part of gs and fs (via. MSRs and swapgs) to make it easier to port operating systems (Windows) to long mode.
In other words, for 80x86 there are 3 different ways to set a segment's information (by calculation, by table lookup, or by MSR).
Also note that for most instructions (excluding things like segment register loads) 80x86 CPU's don't care how a segment's information was set and only use the hidden parts of segment registers. This means that the CPU doesn't have to consult a table every time it fetches code from cs and every time it fetches data from memory. It also means that the majority of the CPU doesn't care which mode the CPU is in (e.g. instructions like mov eax,[ds:address] only depend on the values in the hidden part of segment registers and don't depend on the CPU mode); which is why there's no benefit to removing obsolete CPU modes (removing support for real mode wouldn't reduce the size or complexity of the CPU).
For other CPUs; most don't support segmentation (and only support paging or nothing), and I'm not familiar with how it works for any that do support it. However I doubt any CPU would do a table lookup every time anything is fetched (it'd be far too slow/expensive to be practical); and I'd expect that for all CPUs that support segmentation, information for "currently in use" segments is stored internally somehow.
The Segment table is the reference whenever you are using the memory . So the table has to be stored permanently for later use , so it is stored in the Physical Address i.e.., the RAM.
I've got a question about virtual memory management, more specifically, the address translation.
When an application runs, the CPU receives instructions containing virtual memory addresses, and translates them into physical addresses via the page table.
My question is, since the page table also aside at a memory block, does that means the CPU has to access the memory twice in a single memory-access instruction? If the answer is no, then how does this actually work? Which part did I miss?
Could anyone give me some details about this?
As usual the answer is neither yes or no.
Worst case you have to do a walk of the page table, which is indeed stored in (some kind of) memory, this is not necessarily only one lookup, it can be multiple lookups, see for example a two-level table (example from wikipedia).
However, typically this page table is accompanied by a hardware assist called the translation lookaside buffer, this is essentially a cache for the page table, the lookup process can be seen in this image. It works just as you would expect a cache too work, if a lookup succeeds you happily continue with the physical fetch, if it fails you proceed to the aforementioned page walk and you update the cache afterwards.
This hardware assist is usually implemented as a CAM (Content Addressable Memory), something that's most used in network processing but is also very useful here. It is a memory-component that does not do the lookup based upon an address but based upon 'content', or any generic key (the keys dont' have to be contiguous, incrementing numbers). In this case the key would be your virtual address, and the resulting memory lookup would be your physical address. As this CAM is a separate component and as it is very fast you could state that as long as you hit it you don't incur any extra memory overhead for virtual -> physical address translation.
You could ask why they don't put the whole page table in a CAM? Quite simply, CAM's are both quite expensive and more importantly quite energy-hungry, so you don't want to make them too big (we wouldn't want a laptop that requires 1KW to run do we?).
Sometimes.
The MMU contains a cache of virtual to physical address mapping, called a TLB (Translation Lookaside Buffer).
If the page in question is not in the TLB (a TLB miss), then it needs to load the relevant piece of page table from main memory into that cache first, which will need additional memory access.
Finally, if the page cannot be found at all, a trap is issued to the CPU (a page fault), and the CPU have an opportunity to fix this - e.g. allocate memory, load the piece from a file, swap space and similar.
The details on how this is done varies between architectures, on some, the TLB miss also involves the CPU to configure the TLB, though on most this is automatic. (but the CPU would have to flush the TLB when doing a context switch, and load a new pagetable for e.g. a new process)
More info e.g. here https://www.kernel.org/doc/gorman/html/understand/understand006.html
I'm a student doing some research on Hyper-threading recently. I'm a little confused about the feature - L1 Data Cache Context Mode.
In the architecture optimization manual, it was described that L1 cache can operate in two modes:
The first level cache can operate in two modes depending on a context-ID bit:
Shared mode: The L1 data cache is fully shared by two logical processors.
Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical processors sharing the L1 data cache.
However, I am curious about how cache get partitioned in the adaptive mode according to the description.
On Intel arch, a value of 1 of L1 Context ID indicates the L1 data cache mode can be set to either adaptive mode or shared mode, while a value of 0 indicates this feature is not supported. Check the definition of IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) for details.
According to Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A (Chapter 11/Cache Control), which I quote below:
Share mode
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the logical processors use identical CR3 registers and paging modes. In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst microarchitecture that support Intel Hyper-Threading Technology.
Adaptive Mode
Adaptive mode facilitates L1 data cache sharing between logical processors. When running in adaptive mode, the L1 data cache is shared across logical processors in the same core if:
• CR3 control registers for logical processors sharing the cache are identical.
• The same paging mode is used by logical processors sharing the cache.
In this situation, the entire L1 data cache is available to each logical processor (instead of being competitively shared).
If CR3 values are different for the logical processors sharing an L1 data cache or the logical processors use different paging modes, processors compete for cache resources. This reduces the effective size of the cache for each logical processor.
Aliasing of the cache is not allowed (which prevents data thrashing).
I just guess there is no definite approach for partitioning the L1 data cache.
The document just states that if you use the adaptive mode and if CR3 or the paging mode differ between cores, the cache is not shared and the cores "compete" for the cache. It doesn't tell how the partitioning works.
The most straightforward manner to implement this would be to statically reserve half of the ways of the data cache to each of the processors. This would essentially assign half the data cache statically to each processor.
Alternatively they could add an additional bit to the virtual tag of each cache line to distinguish which processor the line belongs to. This would allow a dynamic partition of the cache. This fits the description of "competing" for the cache better than a static partition.
If you really need to know, you could design some micro-benchmarks to verify that one these schemes is actually used.
The L1 data cache is not partitioned in either mode and is always competitively shared.
Note that there is an obvious error in the manual, the mode isn't determined by the context-ID bit, but by IA32_MISC_ENABLE[24]. This enhancement is supported on later steppings of Northwood with HT and all Prescott processors with HT. The default value is zero, which represents adaptive mode. However, in certain processors, an updated BIOS may switch to shared mode by setting IA32_MISC_ENABLE[24] due to a bug in these processors that occurs only in adaptive mode.
In earlier steppings of Northwood with HT, only shared mode is supported. In shared mode, when a load request is issued to the L1 data cache, the request is first processed on the "fast path," which involves making a way prediction based on bits 11-15 of the linear address and producing a speculative hit/miss signal as a result. In processors with HT, the logical core ID is also compared. Both the partial tag and logical core ID have to match in order to get a speculative hit. In general, this helps improving the correct speculative hit rate.
If the two sibling logical cores operate in the same paging mode and have identical CR3 values, which indicate that accesses from both cores use the same page tables (if paging is enabled), it would be better to produce a speculative hit even if the logical core ID doesn't match on the fast path of the cache.
In adaptive mode, a context ID value is calculated whenever the paging mode or the CR3 register of one of the cores is changed. If the paging modes and the CR3 values match, the context ID bit is set to one of the two possible values. Otherwise, it's set to the other value. When a load request is issued to the cache, the context ID is checked. If indicates that the cores have the same address translation structures, the logical core ID comparison result is ignored and a speculative hit is produced if the partial virtual tag matched. Otherwise, the logical core ID comparison takes effect as in shared mode.