ARM Linux the validity of PTE - linux-kernel

I know that in ARM processor's MMU, some bits such as referenced bit or dirty bits are not provided in PTE. And as a result two page tables are provided in these systems – the processor-native page tables, with neither referenced bits nor dirty bits, and software-maintained page tables with the required bits present.
However, my question is since there is no validity bit in the processor-native page table, how does the MMU figure out the translation is invalid, and generates a page fault?

Related

How does TLB differentiate between entries of different Page tables?

Since different processes have their own Page table, How does the TLB cache differentiate between two page tables?
Or is the TLB flushed every time a different process gets CPU?
Yes, setting a new top-level page table phys address (such as x86 mov cr3, rax) invalidates all existing TLB entries1, or on other ISAs possibly software would need to use additional instructions to ensure safety. (I'm guessing about that, I only know how x86 does it).
Some ISAs do purely software management of TLBs, in which case it would definitely be up to software to flush all or at least the non-global TLB entries on context switch.
A more recent CPU feature allows us to avoid full invalidations in some cases. A context ID gives some extra tag bits with each TLB entry, so the CPU can keep track of which page-table they came from and only hit on entries that match the current context. This way, frequent switches between a small set of page tables can keep some entries valid.
On x86, the relevant feature is PCID (Process Context ID): When the OS sets a new top-level page-table address, it's associated with a context ID number. (maybe 4 bits IIRC on current CPUs). It's passed in the low bits of the page-table address. Page-tables have to be page aligned so those bits are actually unused; this feature repurposes them to be a separate bitfield, with CR3 bits above the page-offset used normally as the physical page-number.
And the OS can tell the CPU whether or not to flush the TLB when it loads a new page table, for either switching back to a previous context, or recycling a context-ID for a different task. (By setting the high bit of the new CR3 value, mov cr, reg manual entry.)
x86 PCID was new in 2nd-gen Nehalem: https://www.realworldtech.com/westmere/ has a brief description of it from a CPU-architecture PoV.
Similar support I think extends to HW virtualization / nested page tables, to reduce the cost of hypervisor switches between guests.
I expect other ISAs that have any kind of page-table context mechanism work broadly similarly, with it being a small integer that the OS sets along with / as part of a new top-level page-table address.
Footnote 1: Except for "global" ones where the PTE indicates that this page will be mapped the same in all page tables. This lets OSes optimize by marking kernel pages that way, so those TLB entries can stay hot when the kernel context-switches user-space tasks. Both page tables should actually have valid entries for that page that do map to the same phys address, of course. On x86 at least, there is a bit in the PTE format that lets the CPU know it can assume the TLB entry is still valid across different page directories.

Reference bit synchronization for TLB and page table

If a PTE is in the TLB, then in the page table, it is not recently accessed, does that mean when NRU replacement policy is used, it is very likely for this PTE to be replaced? Or is there any kind of mechanism that synchronizes the reference bit TLB and page table?

How is LFR/LRU implemented?

How are page replacement policies like LRU/LFU implemented ? Does the hardware MMU track the reference count(in case of LFU)?
Is it possible for it to be implemented as part of the kernel?
Generally, the hardware provides minimal support for tracking which pages are accessed, and the OS kernel then uses that to implement some kind of pseudo-LRU paging policy.
Fox example, on x86, the MMU will set the 'A' bit in the PTE (page table entry) whenever a page is accessed. So the kernel continuously loops though all the memory in use, checking, and clearing this bit. Any page that has the bit set has been accessed since the last sweep, and any page where the bit is (still) clear since the last sweep has not. These pages are candidates for replacement. The details vary from OS to OS, but generally there's some sort of queue structure(s) where these pages are tracked, and the oldest ones replaced.

How many page tables do Intel x86-64 CPUs access to translate virtual memory?

I am trying to understand the number of tables looked-up, when translating a virtual address to a physical address. The Intel manual seems to state numerous schemes:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf
(section 4)
whereas Ulrich Drepper's paper says there are typically 4:
http://www.akkadia.org/drepper/cpumemory.pdf
(page 38)
Whereas I seem to find a lot of diagrams online only showing two:
Could somebody explain which how many page tables are typically accessed on an Intel CPU, or whether it depends on OS configuration (like Huge memory pages etc)?
Short answer
The most commonly used number of page tables on x86-64 system is 4. This is assuming a 64-bit OS using what Intel calls IA-32e paging mode and 4 KB pages. Fewer page tables are needed for some of the other modes (as described in the Intel documentation).
Detailed answer
Figure 4-1 from the Intel 64 and IA-32 Architectures Software Developer’s Manual shows the possible configurations. The columns to focus on are the two address width columns and the page sizes columns. The reason that you see so many different options is because each of these different combinations changes how the pages tables are used and how many are needed.
The most commonly used configuration on x86-64 systems today is the IA-32e paging mode with 4 KB pages and the details for how this works are shown in Figure 4-8.
The value in register CR3 points to the root of the paging structure. There is a 48-bit linear address (the program's virtual address) that needs to be translated to a physical address (with up-to 52 bits). The page offset for a 4 KB page is 12 bits, so that leaves 36-bits in the linear address to index into the page table. The more bits that are used to index into each table structure, the larger that table would need to be. What Intel has done is divide the page table into 4 levels, and each level is accessed with 9 index bits.
If you are using 2 MB pages then you have 21 bits to offset into the page. And so one of the table used in the translation step can be removed, while still keeping the other tables the same size (shown in Figure 4-9).
The other configurations follow the same pattern and you can look in the Intel manual for more detail if necessary.
I suspect that the reason you see diagrams online with only two levels is because that provides enough details to explain the overall concepts used in paging. The additional levels are simply an extension of the same concepts, but tuned for the particular address size and page table size that the architecture wants to support.
It is largely OS dependent. Intel likes to make their processors hyperconfigurable. The number of page table layers is designed to be configurable to handle different physical addresses. For 32-bit addresses (in 32-bit mode) Intel says two levels are normal. For 40-bit addresses, Intel says three levels are normal.
Larger physical addresses => More levels.
Larger pages => Fewer levels
Some non-intel processors take the more rational approach of making the page tables pageable.

In operating system, How MMU searches for virtual page number as key in page table

1)So lets say a single level page table
3)A TLB miss happens
3)The required page table is at main memory
Question : Does MMU always fetch the page table required to a number of registers inside it so that fast hardware search like TLB can be performed? I guess no that would be costly hardware
4)MMU fetch the physical page number (I guess MMU must be saved it with a format like high n-bits as virtual page no. and low m bits as physical page frame no. Please correct and explain if I am wrong)
Question: I guess there has to be a key-value map with Virtual page no as key and physical frame no. as value. How MMU search for the key in the page table. If it is a s/w like linear search than it would be very costly.
5)With hardware it appends offset bits to page frame no.
and finally a read occurs for physical address.
So this question is bugging me a lot, how the MMU performs the search for given key(virtual page entry) in page table?
The use of registers for a page table is satisfactory if the page
table is reasonably small(for example, 256 entries). Most contemporary
computers, however, allow the page table to be very large (for
example, 1 million entries). For these machines, the use of fast
registers to implement the page table is not feasible. Rather, the
page table is kept in main memory, and a page table base register (PTBR) points to the page table.
Changing page tables requires changing only this one register,
substantially reducing context-switch time.
The problem with this
approach is the time required to access a user memory location. If we
want to access location i, we must first index into the page table,
using the value in the PTBR offset by the page number for i. This task
requires a memory access. It provides us with the frame number, which
is combined with the page offset to produce the actual address. We can
then access the desired place in memory. With this scheme, two memory
accesses are needed to access a byte (one for the page-table entry,
one for the byte). Thus, memory access is slowed by a factor of 2.
This delay would be intolerable under most circumstances. We might as
well resort to swapping!
The standard solution to this problem is to
use a special, small, fastlookup hardware cache, called a translation look-aside buffer(TLB) . The
TLB is associative, high-speed memory. Each entry in the TLB consists
of two parts: a key (or tag) and a value. When the associative memory
is presented with an item, the item is compared with all keys
simultaneously. If the item is found, the corresponding value field is
returned. The search is fast; the hardware, however, is expensive.
Typically, the number of entries in a TLB is small, often numbering
between 64 and 1,024.
Source:Operating System Concepts by Silberschatz et al. page 333

Resources