If a PTE is in the TLB, then in the page table, it is not recently accessed, does that mean when NRU replacement policy is used, it is very likely for this PTE to be replaced? Or is there any kind of mechanism that synchronizes the reference bit TLB and page table?
Related
Since different processes have their own Page table, How does the TLB cache differentiate between two page tables?
Or is the TLB flushed every time a different process gets CPU?
Yes, setting a new top-level page table phys address (such as x86 mov cr3, rax) invalidates all existing TLB entries1, or on other ISAs possibly software would need to use additional instructions to ensure safety. (I'm guessing about that, I only know how x86 does it).
Some ISAs do purely software management of TLBs, in which case it would definitely be up to software to flush all or at least the non-global TLB entries on context switch.
A more recent CPU feature allows us to avoid full invalidations in some cases. A context ID gives some extra tag bits with each TLB entry, so the CPU can keep track of which page-table they came from and only hit on entries that match the current context. This way, frequent switches between a small set of page tables can keep some entries valid.
On x86, the relevant feature is PCID (Process Context ID): When the OS sets a new top-level page-table address, it's associated with a context ID number. (maybe 4 bits IIRC on current CPUs). It's passed in the low bits of the page-table address. Page-tables have to be page aligned so those bits are actually unused; this feature repurposes them to be a separate bitfield, with CR3 bits above the page-offset used normally as the physical page-number.
And the OS can tell the CPU whether or not to flush the TLB when it loads a new page table, for either switching back to a previous context, or recycling a context-ID for a different task. (By setting the high bit of the new CR3 value, mov cr, reg manual entry.)
x86 PCID was new in 2nd-gen Nehalem: https://www.realworldtech.com/westmere/ has a brief description of it from a CPU-architecture PoV.
Similar support I think extends to HW virtualization / nested page tables, to reduce the cost of hypervisor switches between guests.
I expect other ISAs that have any kind of page-table context mechanism work broadly similarly, with it being a small integer that the OS sets along with / as part of a new top-level page-table address.
Footnote 1: Except for "global" ones where the PTE indicates that this page will be mapped the same in all page tables. This lets OSes optimize by marking kernel pages that way, so those TLB entries can stay hot when the kernel context-switches user-space tasks. Both page tables should actually have valid entries for that page that do map to the same phys address, of course. On x86 at least, there is a bit in the PTE format that lets the CPU know it can assume the TLB entry is still valid across different page directories.
According to my understanding, load/store operations would access some data of a virtual memory address(vaddr), and this vaddr would be translated into physical address(paddr) in order to be fulfilled by the memory hierarchy.
The translation process would first look up in TLB, if no match is found, a multi level(?) page table look up is then triggered.
My question is: will the page table be put in L1D cache, L2 cache or LLC, besides the quite limited TLB entries?
Can someone clearly explain me the difference between a cache miss, a tlb miss and page fault, and how do these affect the effective memory access time?
Let me explain all these things step by step.
The CPU generates the logical address, which contains the page number and the page offset.
The page number is used to index into the page table, to get the corresponding page frame number, and once we have the page frame of the physical memory(also called main memory), we can apply the page offset to get the right word of memory.
Why TLB(Translation Look Aside Buffer)
The thing is that page table is stored in physical memory, and sometimes can be very large, so to speed up the translation of logical address to physical address , we sometimes use TLB, which is made of expensive and faster associative memory, So instead of going into page table first, we go into the TLB and use page number to index into the TLB, and get the corresponding page frame number and if it is found, we completely avoid page table( because we have both the page frame number and the page offset) and form the physical address.
TLB Miss
If we don't find the page frame number inside the TLB, it is called a TLB miss only then we go to the page table to look for the corresponding page frame number.
TLB Hit
If we find the page frame number in TLB, its called TLB hit, and we don't need to go to page table.
Page Fault
Occurs when the page accessed by a running program is not present in physical memory. It means the page is present in the secondary memory but not yet loaded into a frame of physical memory.
Cache Hit
Cache Memory is a small memory that operates at a faster speed than physical memory and we always go to cache before we go to physical memory. If we are able to locate the corresponding word in cache memory inside the cache, its called cache hit and we don't even need to go to the physical memory.
Cache Miss
It is only after when mapping to cache memory is unable to find the corresponding block(block similar to physical memory page frame) of memory inside cache ( called cache miss ), then we go to physical memory and do all that process of going through page table or TLB.
So the flow is basically this
1.First go to the cache memory and if its a cache hit, then we are done.
2. If its a cache miss, go to step 3.
3. First go to TLB and if its a TLB hit, go to physical memory using physical address formed, we are done.
4. If its a TLB miss, then go to page table to get the frame number of your page for forming the physical address.
5. If the page is not found, its a page fault.Use one of the page replacement algorithms if all the frames are occupied by some page else just load the required page from secondary memory to physical memory frame.
End Note
The flow I have discussed is related to virtual cache(VIVT)(faster but not sharable between processes), the flow would definitely change in case of physical cache(PIPT)(slower but can be shared between processes). Cache can be addressed in multiple ways. If you are willing to dive deeply have a look at this and this.
This diagram might help to see what will happen when there is a hit or a miss.
Just imagine a process is running and requires a data item X.
At first cache memory will be checked to see if it has the requested data item, if it is there(cache hit), it will be returned.If it is not there(cache miss), it will be loaded from main memory.
If there is a cache miss main memory will be checked to see if there is page containing the requested data item(page hit) and if such page is not there (page fault), the page containing the desired item has to be brought into main memory from disk.
While processing the page fault TLB will be checked to see if the desired page's frame number is available there (TLB hit) otherwise (TLB miss)OS has to consult page table for servicing page fault.
Time required to access these types memories:
cache << main memory << disk
Cache access requires least time so a hit or miss at certain level drastically changes the effective access time.
What causes page faults? Is it always because the memory has been
moved to hard disk? Or just moved around for other applications?
Well, it depends. If your system does not support multiprogramming(In a multiprogramming system there are one or more programs loaded in main memory which are ready to execute), then definitely page fault has occurred because memory has been moved to hard disk.
If your system does support multiprogramming, then it depends on whether your operating system uses global page replacement or local page replacement. If it uses global, then yes there is a chance that memory has been moved around for other applications. But in local, the memory has been moved back to hard disk. When a process incurs a page fault, a local page replacement algorithm selects for replacement some page that belongs to that same process. On the other hand a global replacement algorithm is free to select any page in from the entire pool of frames. This discussion about these pops up more when dealing with thrashing.
I am confused of the difference between TLB miss and page faults.
TLB miss occurs when the page table entry required for conversion of virtual address to physical address is not present in the TLB(translation look aside buffer). TLB is like a cache, but it does not store data rather it stores page table entries so that we can completely bypass the page table in case of TLB hit as you can see in the diagram.
Is page fault a crash? Or is it the same as a TLB miss?
Neither of them is a crash as crash is not recoverable. But it is well known that we can recover from both page fault and TLB miss without any need for aborting the process execution.
The Operating system uses virtual memory and page tables maps these virtual address to physical address. TLB works as a cache for such mapping.
program >>> TLB >>> cache >>> Ram
A program search for a page in TLB, if it doesn't find that page it's a TLB miss and then further looks for the page in cache.
If the page is not in cache then it's a cache miss and further looks for the page in RAM.
If the page is not in RAM, then it's a page fault and program look for the data in secondary storage.
So, typical flow would be
Page Requested >> TLB miss >> cache miss >> page fault >> looks in secondary memory.
What is the difference between architected TLB and architected page table?
A TLB is a hardware structure not unlike a cache or a register file. It resides inside the processor. A page table is a structure in main memory. Wikipedia calls architected TLBs "software-managed TLBs" and an architected page table a "hardware-managed TLB".
The difference between which is architected is only important for the implementation of virtual memory. In case of an architected TLB the operating system has to manipulate the TLB directly. Because the capacity of the TLB is limited, the operating system will likely have an internal structure resembling a page table for each process. A downside of an architected TLB is the high cost to bring in a new entry by software. Another is that the number of TLB entries is fixed across different processor generations. An example of this approach is MIPS.
A processor with an architected page table will likely have a TLB too. But it is transparent to software which only sees the page table. This makes TLB refills cheaper and allows to use a different TLB (e.g. bigger, multi-level) for each processor generation. The downside is additional complexity as the processor has to detect updates of the page table transparently and needs logic to perform the page table walks. An example of this approach is x86.
What is a TLB shootdown in SMPs?
I am unable to find much information regarding this concept. Any good example would be very much appreciated.
A TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches.
That process is called a "TLB shootdown".
A quick example:
You have some memory shared by all of the processors in your system.
One of your processors restricts access to a page of that shared memory.
Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
I think the question demands a more detailed answer.
page table: a data structure that stores the mapping between virtual memory (software) and physical memory (hardware)
however, the page table can be quite large and traversing the page table (to find the virtual address's corresponding physical address) can be a time consuming process. To make this process faster, a cache called the TLB (Translation Lookaside Buffer) is used, which stores the recently accessed virtual memory addresses.
As can be clearly seen the TLB entries need to be in sync with their respective page table entries at all times. Now the TLBs are a per-core cache ie. every core has its own TLB.
Whenever a page table entry is modified by any of the cores, that particular TLB entry is invalidated in all of the cores. This process is called TLB shootdown.
TLB flushing can be triggered by various virtual memory operations that change the page table entries like page migration, freeing pages etc.