Does a page table entry only contain metadata? - memory-management

I'm trying to understand how OS does the swapping between the disk and RAM when a page fault occurs. For instance, assume the page table of a process is full and a swap needs to happen.
Does the frame to which the page entry in the page table point to get written to disk itself, then the frame gets overwritten by the new data that is requested? Or does the page itself contain the frame data in this case?
Additionally, if two virtual addresses can map to the same physical address (assuming it's not because of shared memory), is whatever data that was written in the frame the page they belong to points to get written to disk itself?
My confusion comes from the fact that the book I'm reading (and many resources online) mention that the 'page' gets written to disk, but the page from what I understand only contains metadata regarding the frame and the address and not the memory data themselves. So, how does it work, exactly?

Does a page table entry only contain metadata?
Yes.
In general there are virtual pages (what software sees), physical pages (actual pages of RAM that hardware sees), and metadata (page table entries) that describe the relationship between virtual pages and physical pages.
Assume that there's 90 virtual pages (of data being used by programs) and 100 physical pages (of actual RAM), where:
virtual page #0 = physical page #38
virtual page #1 = physical page #22
virtual page #2 = physical page #41
...
virtual page #89 = physical page #12
Note that this mapping is (a crudely simplified version of) metadata stored in page table entries.
Now assume software allocates a new page (so that there would be 91 virtual pages being used by programs); and the OS decides there isn't enough free physical pages of RAM (because the OS needs to ensure there's a little free physical memory for its own use) so it decides to send a page to swap space. The result might be:
virtual page #0 = physical page #38
virtual page #1 = NOT PRESENT (sent to swap space)
virtual page #2 = physical page #41
...
virtual page #89 = physical page #12
virtual page #90 = physical page #73
Now assume a program tries to use data in virtual page #1. The CPU can't figure out where the data is (because the metadata in the page table says the page isn't present); so it informs the OS. The OS determines what happened (if it's a software bug or ...) and decides to store a different virtual page's data in swap space so it can re-use that physical page to load the data for virtual page #1 into the same physical page. The result might be:
virtual page #0 = physical page #38
virtual page #1 = physical page #41
virtual page #2 = NOT PRESENT (sent to swap space)
...
virtual page #89 = physical page #12
virtual page #90 = physical page #73
After this the software that wanted data from virtual page #1 can continue (because the CPU can find the data now that the OS changed the metadata).

Related

how a virtual address is mapped to address on the swap partition in paging operation

I'm wondering if anyone could help me understand how a virtual address is mapped to its address one the backing store, which is used to hold moved-out pages of all user processes.
Is it a static mapping or a hash algorithm? If it's static, where such mapping is kept? It seems it can't be in the TLB or page table since according to https://en.wikipedia.org/wiki/Page_table, the PTE will be removed from both TLB and page table when a page is moved out. A description of the algorithm and C structs containing such info will be helpful.
Whether it's static mapping or hash algorithm, how to garrantee no 2 process will map its address to the same location on the swap partition, since the virtual address space of each process is so big (2^64) and the swap space is so small?
So:
during page-in, how the OS know where to find the address (corresponding to the virtual address accessed by the user process) on the swap partition to move in?
when a physical page needs to be paged out, how does the OS know where to put on the swap partition?
For the first part of your question : It is actually hardware dependent but the generic way is to keep a reference to the swap block containing the swapped out page (Depending on the implementation of the swap subsystem, it could be a pointer or a block number or an offset into a table) in it's corresponding page table entry.
EDIT:The TLB is a fast associative cache that help to do the virtual to physical page mapping very quickly. When a page is swapped out, it's entry in the TLB could be replaced by a newly active Page. But the entry in the page table cannot be replaced because page tables are not associative memory. A page table remains persistent in memory for all the duration of the process and no entry could be removed or replaced (By another virtual page). Entries in page tables could only be mapped or unmapped. When they are unmapped (Because of Swapping or freeing), the content of the entry could either hold a reference to the swap block or just an invalid value.
For the second part of your question : The system kernel maintains a list of free blocks in the swap partition. Whenever it needs to evict a RAM page, it allocates a free block and then the block reference is returned so that it can be inserted in the PTE. When the page comes back to RAM, the disk block is freed so that it could be used by other pages.
During page-in, how the OS know where to find the page (corresponding to the virtual address accessed by the user process) on the swap device to move in?
That's can actually be a fairly complicated process. The operating system has to maintain a table of where the process's pages are mapped to. This can be complicated because pages can be mapped to multiple devices and even multiple files on the same device. Some systems use the executable file for paging.
when a physical page needs to be paged out, after the virtual address for a physical page is looked up in TLB, how does the OS know where to put on the swap device?
On a rationally designed operating system, the secondary storage is allocated (or determined) when the virtual page is mapped to the process. That location remains fixed for the duration of the program being run.

Setting up memory pagination

Next, the loader creates a basic page table. This page table maps the 64 MB at the base of virtual memory (starting at virtual address
0) directly to the identical physical addresses. It also maps the same
physical memory starting at virtual address LOADER_PHYS_BASE, which
defaults to 0xc0000000 (3 GB). The Pintos kernel only wants the latter
mapping, but there's a chicken-and-egg problem if we don't include the
former: our current virtual address is roughly 0x20000, the location
where the loader put us, and we can't jump to 0xc0020000 until we turn
on the page table, but if we turn on the page table without jumping
there, then we've just pulled the rug out from under ourselves.
A citation comes from: https://web.stanford.edu/~ouster/cgi-bin/cs140-winter16/pintos/pintos_6.html
I think that I don't understand exactly what does it is written above.
My issues:
Let's assume that kernel was loaded at 0x20000 physical address by bootloader. From what I understand two different virtual address are mapped to the same place- to the kernel. The first, direct mapping: ~0x20000 -> ~0x20000. And the second ~0xc0000000 -> ~0x20000. But, why there is need to use two mappings?
I cannot see why the page table ( and page directory) must be mapped by identity.
Please explain.

linux kernel page table update

In linux x86 paging.
each process has it's own page directory.
page table walking starts with page directory which is pointed by CR3.
every process shares the kernel page directory content
assuming three sentences are correct, let's say some process enters kernel
mode and updates his kernel page directory content(address mapping, access
rights, etc...)
Question. since kernel address spaces is globally shared among processes,
this update has to be synchronized with other process's page directory,
right?
how can this be managed?
I don't know about Linux, so I'll answer for Windows. Some of the kernel space is 'global', which is a flag set in the PTE to indicate it is used by more than one process. The INVPCID instruction can be configured in the register operand to include or exclude these entries in a TLB invalidate. These page table entries are shared between the processes and all appear at the same place in the page table for each process. This way, only the single PTE needs to be updated and it doesn't need to synchronise other PTEs of other processes as they all share a single PTE at a physical address.
http://www.cs.miami.edu/home/burt/journal/NT/memory.html
Some kernel memory is not visible to all processes and is private to each process (doesn't change the fact it is still ring 0). This, on a 32 bit Windows system would be 0xC0000000–0xC0200000 which contains all the user space PTEs and PDEs where 0xC0000000 is the PTE_BASE which allows for the equation
#define MiGetPteAddress (x) ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase))
#define MiAddressToPte(x) MiGetPteAddress(x)
to work elegantly for converting faulting virtual address in cr2 to the address of the PTE. This is private to each process as each process has the same base PTE allocation base address; if it were visible to all processes it would quickly take up virtual memory as each set of page takes would have to be allocated sequentially. It doesn't need to be visible to all processes because a process has no interest in the page table entries of another process. A page fault is always handled in the context of the current process, and 0xC0000000–0xC0200000 means something different in each process context.
The kernel space 0xC0200000–0xC0400000 for allocation of kernel PTEs (for kernel addresses) would however be global and shared by all processes, except for the section within it representing 0xC0000000–0xC0200000, which by my calculation will be 0xC0300000–0xC0300800, which is the user-mode side of the PDEs as PDE_BASE = 0xC0300000–0xC0300FFF.
It is however impossible to split up the user PDE and kernel PDE section such that the former is private and the latter is global (i.e. make 0xC0300000–0xC0300800 private (point to different physical addresses) and 0xC0300000–0xC0300FFF point to the same physical address for each process) because the whole PDE region (0xC0300000–0xC0300FFF) will lie on the same physical frame and constitutes a single frame pointed to by cr3, and the cr3 is different for each process, which means that the whole PDE region (all PDEs) would have to private per process (duplicated and installed per process). If a kernel page table page (a page containing a kernel page table) were paged out and in to a new physical location then the PDEs would all have to be synchronised because all processes have copies at different cr3 physical addresses and not the same physical PDE. I'm not sure how it does this (efficiently) ATM therefore it would be wise to impose the restriction of not allowing the kernel page tables to be paged out and have them in non-paged pool; this way the kernel PDEs will remain constant across all CR3 pages. On 64 bit, the restriction could be imposed that kernel PDPTs can't be paged out. On 32 bit Windows, a process is started with a physical CR3 page with a PDE at offset 1100000000(base 2)*4 bytes pointing to itself which is hardwritten in, probably by briefly turning off paging in cr0 (because the write won't succeed without the recursive entry that needs to be written being there, creating a paradox). Notice, the PD Entry for itself is the page table that covers the range 0xC0000000–0xC0400000 i.e. it points to 1023 page tables and 1 page directory (itself) (2^10 entries) and hence allows the PTEs to be modified by their virtual address. The reason why the CR3 page is at 0xC0300000 is because the address has the same page directory and page table indexes 1100000000 and 1100000000 so it loops back on itself twice, therefore yielding the CR3 page and you can modify the PDEs by address (there are other addresses that are special like this e.g. 0xE0380000). After it is set up, the appropriate kernel mappings are made. On 64 bit Windows it would be similar where a process is set up with a single PML4 table page which points to itself and this way any PML4E, PDPTE, PDE or PTE can be filled in and accessed due to the variable amount of loopbacks. On 64 bit Windows, when a process is terminated, all the physical pages of the process get moved to the free list which would include all user physical PDPT pages, PD pages, PT pages and the PML4/CR3 page. The kernel ones would not be marked for the free list.
In general, if you know what entry in the PML4 is the recursive entry to the physical PML4 page you, can work out the virtual address of the PTE structure that services (is used to translate) a particular virtual address range and a particular virtual address in that range. You append the offset (10 bits for 32 bit; 9 bits for 64 bit) in the PML4 to the entry to itself, to the start of a virtual address whose servicing PTE virtual address you want to find (which is what the addition of 0xC0000000 is in the 32 bit equation earlier) and remove the last 12 bits and then make up the offset in the PT now at the end of the virtual address to 12 bits by multiplying it by 8 (or 4) (hence the right shift by 12 and the left shift by 3 (or 2 for 32 bit entries)). 1 loopback takes away 1 layer of indirection and you get the virtual address of the PTE. 2 loopbacks will leave you with the virtual address of the PDE that's used to translate that particular virtual address and so on. PTE_BASE on 32 bit windows is the offset 110000000 left shifted to make 32 bits and PDE_BASE is the offset 110000000110000000 left shifted to make 32 bits. It is used in the macro and any virtual address with this prefix will by definition be part of a PTE or a PDE respectively. Windows chooses the offset 1100000000 for the page table hierarchy but it could be any one of the 2^9 combinations.
KAISER, or KPTI, designed to mitigate meltdown, most likely has 2 cr3s for each process. Upon trapping to the kernel, the restricted cr3 for user mode which would contain a single kernel PML4E—enough for a preliminary interrupt dispatch routine function to be accessible, which performs the swap—would be replaced with the full cr3 containing all kernel PML4Es.
As for physical memory on windows, see here: https://superuser.com/a/1549970/933117
Question. since kernel address spaces is globally shared among processes, this update has to be synchronized with other process's page directory, right?
how can this be managed?
First; understand that paging is usually 2 or more levels of tables. For example (for 80x86), for the oldest "plain 32-bit paging" there are page tables and page directories; and for current long mode there's page map level 4, page directory pointer table, page directory and page table. CR3 points to the highest level table and that must be different for each virtual address space ("process"). For the second highest level table, a single second highest level table can be put into all highest level tables, and if you do that any changes to the second highest level table will automatically change every virtual address space.
This means that (for 80x86), for the oldest "plain 32-bit paging" you can put the same "kernel page table" into all virtual address spaces (all page directories) and when you add/remove pages from that page table it will automatically affect all virtual address spaces; and for current long mode you can put the same page directory pointer table in all virtual address spaces (all page map level 4 tables) and when you add/remove page directories, page tables, or pages, it will automatically affect all virtual address spaces.
This means that you only really need some way to change second highest level page tables (or, some way to change all highest level page table entries). There are multiple ways to do this. The easiest is pre-allocation. For example, if you say "kernel space will always be N MiB" you can pre-allocate all the second highest level tables you'd need for "N MiB" during boot and never change them (e.g. for long mode, you could say that kernel space will be 512 GiB, pre-allocate a single "kernel page directory pointer table", and put that into every page map level 4 when a virtual address space is being created, and then rely on all other changes (to page directories, page tables, etc) automatically affecting the kernel space for all virtual address spaces). I believe this is the method Linux uses (partly because Linux uses the silly "map all RAM into kernel space" security disaster at boot).
However, this is just the table changes alone. There are 2 other concerns.
The first "other concern" is the CPU's translation look-aside buffers (TLBs); which need to be flushed when (virtual address to physical address) translation/s change. Most operating systems use a combination of "lazy TLB shootdown" (where a CPU using wrong information from TLB causes a page fault and page fault handler invalidates and returns so the software that caused the page fault can continue with the new/correct translation without knowing anything happened) and "multi-CPU TLB shootdown" (where you send an inter-processor interrupt to other CPUs and that interrupt handler invalidates the TLB entries).
The second "other concern" is making sure CPUs don't try to change the same thing at the same time. This typically ends up being a problem solved at a higher level. For example, if you acquire a lock for a certain data structure (before changing something in that data structure) and realize you need to allocate/free pages for that data structure (while you're trying to make the changes); then the code that modifies paging tables doesn't need to care about different CPUs changing the page tables at the same time because it knows that something at a higher level (the data structure's lock) already ensures that can't happen.
When the kernel changes page table entries, these updates must be made atomically:
In the 64bit kernel this can be conveniently done using 64bit memory operations, while i386 needs to use CMPXCHG8.
(Source)

OS and Hardware role during a LD instruction

When loading the contents of a virtual address into a particular register, what are some general sequence of events that need to happen in the hardware and operating system as part of the process?
For example,
LD 0xffe4ca32, R1
The address used for this is the virtual address right?
And it would need to go through some address translation first to get a physical address.
My first question is,
When this instruction executes, how is this instruction handled by the Hardware and Operating System?
And my second question is,
Is the "value" of that virtual address, 0xffe4ca32, the contents of its mapped physical address or is it the physical address itself?
Im just not clear what is being loaded into R1
Here:
Let's assume x86. First, the CPU asks the MMU (memory management unit) to to translate the address. First the MMU checks something called the TLB (translation look-aside buffer), where recent translations from virtual to physical are stored. If it is there, the referenced address is returned. Otherwise, the MMU looks up the address in the page table. If the page is either a supervisor only page, or a page marked as not present in memory, the CPU throws a protection fault, or a page fault. For the protection fault, the OS will usually terminate the responsible process however it does that. For a page fault, the OS then checks it's own special paging structures to see if that page has been paged out, or if it just doesn't exist. If it has been paged out, it is read in to some page somewhere in memory, and the virtual address is remapped to that new place. If space cannot be found, another page will be put on disk to make room (a lot of this is called thrashing). If it has not been paged out, the OS will most likely kill the process, as it is trying to reference a non existing page.
Value of mapped physical address. Virtual memory pointers behave just like physical memory pointers in the perspective of user-space. In kernel space, there are some complications as physical memory access is needed (this is usually achieved through something called identity paging, where the first few hundred pages are mapped directly to their corresponding physical memory.

How are base registers, limit registers and relocation registers used?

My understanding in address translation process in MMU(memory management unit)
-> logical address : generated by cpu.programmer concern with this address.
-> virtual address : reside in the hard disk , as a pages.
-> physical address : reside in the RAM. It is the actual address.
1: cpu generate the logical address and send it to the MMU.
2: MMU translate the logical address into the virtual address then translate it to the physical address and send the physical address to RAM.
3: when ever the RAM is full , the page which is not used rapidly is returned to the hard disk , to allocate memory to the other pages(processes).
my questions are :
1) where the value of Relocation register is added?
2) who decide the value of Relocation Register?
3) what to do with the Base register and Limit register , how to use it?
4) where the logical address goes off?
If any body can answer it , It would be grateful to me.
It is requested that , let me know it any misunderstanding in this topic.
-thanks
I can tell you how this works on x86.
All programs in non-64-bit modes operate with addresses combined of two items: segment selector (for brevity "selector" is often omitted in text and that may be confusing) and offset. This selector:offset pair is called the logical address.
The selector portion isn't always explicitly specified or manipulated with in code since the CPU has "default" associations of segment registers containing selectors with specific instructions or specific instruction encodings. It's also uncommon to manipulate selectors in 32-bit mode, but is very often necessary in 16-bit code.
The virtual address is formed from the logical address either "directly" (in real or 8086 virtual mode) or "indirectly" (in protected mode).
"Direct" virtual address = selector * 16 + offset.
"Indirect" virtual address = SegmentDescriptorTable[selector].Base + offset.
SegmentDescriptorTable is either the Global Descriptor Table (AKA GDT) or the Local Descriptor Table (AKA LDT). It's set up by the OS and describes the location and size of various segments of memory. selector is used to select a segment in the table. The Base entry of the table tells the segment's beginning (virtual address). The Limit entry tells the segment size (generally; the details are a little more complex).
When a program tries to access memory with an offset resulting access beyond the end of the segment (the CPU compares offset and Limit), the CPU generates an exception and the OS handles it, by usually terminating the program.
Btw, in real/v86 mode, even though the virtual address is formed directly from selector:offset, there's still a 16-bit Limit imposed on offsets, which is why you need to use a different selector to access more than 64KB of memory.
The Base entry in a segment descriptor can be used to either isolate the segment from the rest of the memory (Limit helps here) or to place or move the entire segment to an arbitrary virtual address without having to modify anything (or much) in the program it belongs to (if we're moving a segment, the data has to be moved in the memory, obviously). Basically, it can be used for relocation purposes. In real/v86 mode for relocation purposes the selector is changed.
The virtual address can be further translated to the physical address if the CPU is running in protected mode and has set up page tables. If there're no page tables, the physical address is the same as the virtual address. The translation is done in blocks of physical memory and address ranges that are called pages (often 4KB).
There's no dedicated relocation register on x86 CPUs. Relocation can be achieved by adjusting:
segment selectors in CPU registers or program's code
segment base addresses in GDT/LDT
offsets in program's code
physical addresses in page tables
As for virtual address : reside in the hard disk , as a pages, I'm not sure what exactly you want to say with this, but just because there's virtual to physical address translation, it doesn't mean there's also virtual on-disk memory. There are other uses for the translation besides virtual on-disk memory. And the addresses reside in the CPU and wherever your (and OS's) code writes them to, not necessarily on the disk.
Your description has a number of mistakes, much of which may be the result of imprecise documentation and common usage.
First of all, there really is no such a thing as a virtual address. There are physical and logical addresses. Sadly, the term virtual address is frequently (even in hardware documentation) used when logical address is what is meant..
The CPU instruction stream always operates on logical addresses (values may refer to physical addresses).
When the CPU needs to access a logical address, the MMU attempts to translate it to a physical addresses. It does that by looking up the address in a page table.
Several things can happen at that point:
There may not be a page table entry for the address => Access violation.
The page table entry is marked invalid => Access violation.
The page table entry indicates that no physical memory is mapped to it => Page fault.
(I omit mode access checks).
It is this last step that last step where virtual memory comes into play. At that point the page fault handler of the operating system needs to find where the corresponding page has been stored to disk, load it, update the page table, and restart the instruction.
The operating system manages the available physical memory by paging writeable memory (that has changed) to disk (read only data does not have to be written back) when there is high demand for physical memory.
I have never heard of a "relocation register" before. But doing a GOOGLE search I can see that some academic material uses it as a confusing pedagogical concept (i.e., with no relation to reality).
Some systems define the page table using base and limit registers. The base registers indicate where the page table starts in memory (this can be either a physical or logical addresses) and the limit register indicates the side of the table.
The registers are usually not loaded directly. Their values are usually written to the hardware Process Context Block (PCB). When the process context is loaded, the page table base and limit are loaded automatically.
On some systems there are multiple page tables. If there are system and user page tables, the user page tables can refer to logical addresses in the system space and the system page tables refer to physical addresses.

Resources