ARM Linux Page tables layout - memory-management

I have read multiple articles on this topic including below but things are still hazy to me:
http://elinux.org/Tims_Notes_on_ARM_memory_allocation
ARM Linux kernel page table
Linux kernel ARM Translation table base (TTB0 and TTB1)
ARM hardware has 4096 entries of 4 byte each in L1 translation table. each entry translates a 1MB region in memory. At second level it has 256 entries of 4 bytes each. And each of second level entry translates a 4KB page in memory.
So according to this any virtual address has to be divided into 12-8-12 to map to above scheme.
But on 32 bit ARM linux side this division is 11-9-12. Where L1 translation table consists of 2048 entries where each entry is 8 bytes. Here two 4 byte entries are clubbed together and the pointed second level translation tables are laid out one after the other in memory, so that at second level instead of 256 there are 512 entries. Additionally since Linux memory management expects various flags non native to ARM we define 512 more entries for linux page table(one for each 2nd level HW page table).
Now the question is Linux does not enforce PGD/PMD/PTE size (however it enforces page size to be 4K. Thus PAGE_SHIFT is set to 12), then why do we select 11-9-12 layout(i.e. 11 bits for PGD and 9 bits for HW PTE).
Is it just to make sure that 512HW +512Linux PTE are aligned to a Page boundary ?
If someone could explain the logic behind this division in detail would be great....

As you say, in the ARM short-descriptor format each second-level page table is 1KB in size. Even with the associated shadow page table that only makes 2KB, meaning 50% of every page allocated for second-level tables would be entirely wasted.
Linux just pretends that the section size is 2MB, rather than the actual 1MB of the hardware, by allocating first-level entries in pairs, so that the corresponding pair of second-level tables can be kept together in a single page, avoid that wastage, and keep the management of page table memory really simple.

The ARM Linux and dirty bits should have all the answers. Mainly, the PTE tables have extra info to emulate bits resulting in the layout you observe.
I think a misconception is the memory an L2 table occupies versus what it maps. You must allocate physical memory for an L2 table and having it symmetric (4K size) make it the same as all pages. Now this 4k page could be four ARM MMU L2 page tables. However, we need some additional information to emulate dirty, young and accessed bits that the Linux generic MMU code requires. So the layout of the Linux L2 (PTE directory) is,
Linux PTE [n]
Linux PTE [n+1]
ARM PTE [n]
ARM PTE [n+1]
At the L1 level each entry is paired (n/n+1) so that it points to item 3 and 4 above. The pgtable-2level.h file has detailed comments on the layout (which should be correct for your version of Linux).
See: Tim's notes on ARM MM
Page table entry (PTE) descriptor in Linux kernel for ARM

Related

x86 address space calculation PAE to 36 bits

I'm having some hard time understanding PAE. I know it creates a 3rd level of indirection via the PDPT, so that the address translation goes from CR3 -> PDPT(4 entries) -> PD(512 entries) -> PT (512 entries) -> PAGE (4096). But the address is still 32 bits, how do you get 36 bit addresses from this scheme? I'd appreciate an example. How does adding another table "increases" the address space?
PAE changes nothing about 32-bit virtual addresses, only the size of physical address they're mapped to. (Which sucks a lot, nowhere near enough virtual address space to map all those physical pages at once. Linus Torvalds wrote a nice rant about PAE: https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/ originally posted on https://www.realworldtech.com/forum/?threadid=76912&curpostid=76973 / https://www.realworldtech.com/forum/?threadid=76912&curpostid=76980)
It also widens a PTE (Page Table Entry) from 4 bytes to 8 bytes, which means 2 levels aren't enough anymore; that's where the small extra level comes to translate the top 2 bits of virtual addresses via those 4 entries.
36-bit only happened to be the supported physical address size in the first generation of CPUs that implemented PAE, Pentium Pro There is no inherent 36-bit limit to PAE.
x86-64 adopted the PTE format, which has room for up to 52-bit physical addresses. Current x86-64 CPUs support the same physical address-size in legacy mode with PAE as they do in 64-bit mode. (As reported by CPUID). That limit is a design choice that saves bits in cache tags, TLB entries, store-buffer entries, etc. and in comparators involved with them. It's normally chosen to be more than the amount of RAM that a real system could actually use, given the commercially available DIMM sizes and number of memory controllers even in multi-socket systems, and still leave room for some I/O address space.
x86-64 came soon after PAE, or soon enough for desktop use to be relevant, so it's a common misconception that PAE is only 36 bits. (Because 64-bit mode is a vastly better way to address more memory, allowing a single process to use more than 2G or 3G depending on user/kernel split.)

Directory Table Base divided 4k, but my windows DTB / 4k = x ... 2

I did some experiments about memory analysis.
I have some problems..
almost Directory Table Base can divide 4k(4096) i know.
But my process in windows 10 (1909) have 0x14695e002 DTB.
So that can't divide 4k. 2 ramians.
Why my windows have that value??
The dirBase / Directory Table base is the value of the CR3 register for the current process. As you may know the CR3 is the base register which (indirectly) points to the base of the PML4 (or PDPT) table and is used when switching between process, which basically switches their entire physical memory.
Base CR3
As you may have seen in the Intel manual the 4 lower bits of the CR3 should be ignored by the CPU (Format of the CR3 register with 4-Level Paging):
4-level paging
Now if you look closely at the at the Intel Manual (Chapter 4.5; 4-level Paging).
A logical processor uses 4-level paging if CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 1
Respectively: Paging; Physical Address Extension; Long Mode Enable.
Use of CR3 with 4-level paging depends on whether process context identifiers (PCIDs) have been enabled by setting CR4.PCIDE.
CR4.PCIDE
CR4.PCIDE is documented in the Intel Manual (Chapter 2.5 Control Registers):
CR4.PCIDE
PCID-Enable Bit (bit 17 of CR4) — Enables process-context identifiers (PCIDs) when set. See Section 4.10.1, “Process-Context Identifiers (PCIDs)”. Can be set only in IA-32e mode (if IA32_EFER.LMA = 1).
So when CR4.PCIDE is set, the 12 (0:11) lower bits of CR3 are used as PCID, that is, a "Process-Context Identifier" (bits 12 to M-1, where M is usually 48, are used for the physical address for the base of the PML4 table).
PCIDs
PCIDs are documented in the Intel Manuel (Chapter 4.10.1; Process-Context Identifiers (PCIDs)):
Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces. The processor may retain cached information when software switches to a different linear address space with a different PCID.
And a little bit further in the same chapter:
When a logical processor creates entries in the TLBs [...] and paging-structure caches [...], it associates those entries with the current PCID.
So basically PCIDs (as far as I understand them) are a way to selectively control how the TLB and paging structure caches are preserved or flushed when a context switch happens.
Some of the instruction that operate on cacheability control (such as CLFLUSH, CLFLUSHOPT, CLWB, INVD, WBINVD, INVLPG, INVPCID, and memory instructions with a non-temporal hint) will check the PCID to either flush everything that concerns a precise PCID or flush only a part of the cache (such as the TLB) and keep everything in relation to a given PCID.
For example the INVPLG instruction:
The INVLPG instruction normally flushes TLB entries only for the specified page; however, in some cases, it may flush more entries, even the entire TLB. The instruction invalidates TLB entries associated with the current PCID and may or may not do so for TLB entries associated with other PCIDs.
The INVPCID specifically uses the PCIDs:
Invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on process-context identifier (PCID)
Why it is always 2 (as far as I can see, it's always 2 for every processes in the system) on Windows, I don't know.

Virtually indexed physically tagged cache Synonym

I am not able to entirely grasp the concept of synonyms or aliasing in VIPT caches.
Consider the address split as:-
Here, suppose we have 2 pages with different VA's mapped to same physical address(or frame no).
The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's as they are mapped to same physical frame.
Now the pageoffset part(bits 0-13) of both the VA's are same as the data which they want to access from a particular frame no is same.
As the pageoffset part of both VA's are same, bits (5-13) will also be same, so the index or set no is the same and hence there should be no aliasing as only single set or index no is mapped to a physical frame no.
How is bit 12 as shown in the diagram, responsible for aliasing ? I am not able to understand that.
It would be great if someone could give an example with the help of addresses.
Note: this diagram has a minor error that doesn't affect the question: 36 - 12 = 24-bit tags for 36-bit physical addresses, not 28. MIPS64 R4x00 CPUs do in fact have 40-bit virtual, 36-bit physical addresses, and 24-bit tags, according to chapters 4 and 11 of the manual.
This diagram is from http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html which does label it as being for MIPS R4x00.
The page offset is bits 0-11, not 0-13. Look at your bottom diagram: the page offset is the low 12 bits, so you have 4k pages (like x86 and other common architectures).
If any of the index bits come from above the page offset, VIPT no longer behaves like a PIPT with free translation for the index bits. That's the case here.
A process can have the same physical page (frame) mapped to 2 different virtual pages.
Your claim that The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's is totally bogus. Translation can change bit #12. So one of the index bits really is virtual and not also physical, so two entries for the same physical line can go in different sets.
I think my main confusion is regarding the page offset range. Is it the same for both PA and VA (that is 0-11) or is it 0-12 for VA and 0-11 for PA? Will they always be same?
It's always the same for PA and VA. The page offset isn't marked on the VA part of your diagram, only the range of bits used as the index.
It wouldn't make sense for it to be any different: virtual and physical memory are both byte-addressable (or word-addressable). And of course a page frame (physical page) is the same size as a virtual page. Right or left shifting an address during translation from virtual to physical would make no sense.
As discussed in comments:
I did eventually find http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html (which includes the diagram in the question!). It says the same thing: physical tagging does solve the cache homonym problem as an alternative to flushing on context switch.
But not the synonym problem. For that, you can have the OS ensure that bit 12 of every VA = bit 12 of every PA. This is called page coloring.
Page coloring would also solve the homonym problem without the hardware doing overlapping tag bits, because it gives 1 more bit that's the same between physical and virtual address. phys idx = virt idx. (But then the HW would be relying on software to be correct, if it wanted to depend on this invariant.)
Another reason for having the tag overlap the index is write-back during eviction:
Outer caches are almost always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.
A write-back cache needs to be able to evict dirty lines (send them to L2 or to physical RAM) long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere. How does the VIPT to PIPT conversion work on L1->L2 eviction
Having the tag include all the physical address bits above the page offset solves this problem: given the page-offset index bits and the tag, you can construct the full physical address.
(Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, there is no write-back; eviction = drop.)

Does paging let us use physical memory that is larger than what can be addressed by the CPU’s address pointer length?

I was reading the dinosaur book on Operating System about memory management. I assume this is one of the best books but there's something about paging written in the book which I don't get.
The book says, "A 32-bit CPU uses 32-bit addresses, meaning that a given process space can only be 2^32 bytes (4 TB ). Therefore, paging lets us use physical memory that is larger than what can be addressed by the CPU’s address pointer length."
I don't quite get this part because if the CPU can only refer to 2^32 different physical addresses, if there were 2^32+1 physical addresses, the last address won't be able to be reached by the CPU. So how can paging help with this?
Also, earlier the book says "Frequently, on a 32-bit CPU , each page-table entry is 4 bytes long, but that size can vary as well. A 32-bit entry can point to one of 2^32 physical page frames. If frame size is 4 KB (2^12 ), then a system with 4-byte entries can address 2^44 bytes (or 16 TB ) of physical memory."
I don't see how that is even possible in ideal/theoretical situations, cuz as I understand it, part of the virtual address will refer to an entry of the page table while the other part of the virtual address will refer to the off-set of that particular type in that page. So in the above-mentioned situation put forward by the book, even if the CPU could point to 2^32 different page entries, it won't be able to read any particular byte within that page cuz it doesn't specify the office.
Maybe I've misunderstood the book or there is some part that I missed out. I much appreciate your help! Thanks a lot!
It sounds like you need to burn your book. It's useless.
"[P]aging lets us use physical memory that is larger than what can be addressed by the CPU’s address pointer length" is complete nonsense (unless the book is assigning two different meanings to the term "paging," in which it is still useless).
Let's start with logical addressing. A logical address is composed of a page selector and and offset into the page. Some number (P) of bits will be assigned to the page selector and the remained will be assigned to the offset. If pages are 2^9 bits, there are 23 bits in the page selector and 9 bits for the byte offset within the page.
Note that the 9/23 pick are arbitrary on my part. Most systems these days use larger pages but these are values have been used in the past.
The 23 bits in the page selector are indices into the process page table.
The size of entries in the page table are going to be a power of 2 (and I have never seen one less than 4). For our purposes let's say that each entry is 8-bytes long.
The bits in the page table entry are divided between those that index physical page frames and control bits. let's make the arbitrary choice that 32 bits index page frames and 32 bits are used for control.
That means the system can theoretically MANAGE 2^32 pages that are 2^9 bytes large or a total of 2^41 bytes. If we were to increase the page size from 2^9 to 2^20, the system could theoretically MANAGE 2^52 (32+20) bytes of memory.
Note that each process can still only ACCESS 2^32 bytes. But in my 9-bit page system, 2^9 processes could each access 2^32 pages simultaneously on a system with 2^41 physical bytes of memory (ignoring the need for a shared system address space in this gross oversimplification).
Note that if I change my page table to 32-bits and assign 9 of those bits to control and and 23 to page frame selection, the system can only MANAGE 2^32 bytes of memory (and that was more common than managing greater than 2^32 bytes).
You quote: "Frequently, on a 32-bit CPU , each page-table entry is 4 bytes long, but that size can vary as well. A 32-bit entry can point to one of 2^32 physical page frames. If frame size is 4 KB (2^12 ), then a system with 4-byte entries can address 2^44 bytes (or 16 TB ) of physical memory."
This is theoretical BS. A system that used all 32 bites of the page table entry as an index to page frames could not function. There would have to be some control bits in the page table.
The quotes you are taking from this book are highly misleading. Few (any?) 32-bit processors could even access 2^32 bytes of memory due to address line limitations.
While it is possible that the use of logical pages could allow a processor to manage more memory that the logical address size suggests, that was not the purpose of managing memory in pages.
The purpose of paging—which in its normal and customary usage refers to the movement of virtual memory pages between physical page frames and secondary storage—is to allow processes to access more virtual memory than there was physical memory on the system.
There is an additional system of memory management that is (thankfully) dying out: segments. Segments also provided a means for systems to manage more physical memory than the logical address space would allow.

page table walk in armv7 linux by S/W leads to which version of page table ARM PTE or Linux PTE

My question is in handle_mm_fault function or any S/W page table walk.
pgd = pgd_offset(mm,address);
pud = pud_offset (pgd,address);
pmd = pmd_offset (pud,address);
pte = pte_offset_map(pmd,address);
Will the final computed pte be the ARM or LINUX version?
In armv7 supports 2 levels of page table; At the first level 4096 entries of each 4 bytes has an address for the second level. The second level has 256 entries of each 4 bytes. Linux has tweaked the page table to have 2048 entries of each 8 bytes in other words having two pointers to second level of page table having 512 entries placed contiguously. Linux PTE stored below these 512 ARM PTE.
So I understand that page table walk through in S/W will leads to ARM PTE only, but that is not correct Linux always operates on Linux PTEs?
Please tell me where I am wrong?
I Got the answer.
I understood that, all these macro will leads to the Linux PTE only , since the L2 level page table of 4kb size and from 0th offset the linux pte0 starts and 1024th offset linux pte1 starts. while calculating pte_index it mask the lower 12 bits of PMD value thus PMD will always point to the start of page and there are Linux PTE stored.
I Got the answer. I understood that, all these macro will leads to the Linux PTE only , since the L2 level page table of 4kb size and from 0th offset the linux pte0 starts and 1024th offset linux pte1 starts. while calculating pte_index it mask the lower 12 bits of PMD value thus PMD will always point to the start of page and there are Linux PTE stored.

Resources