How to page/swap out give region of user memory - linux-kernel

I have to create test cases where user code accesses paged out (swapped out) memory in different scenarios.
I am looking for a Linux Kernel API that may do that for me. I suppose that:
- input arguments should be virtual address and the reference to the virtual space (MMU),
- It will mark the page un-accessible by setting some bits in page table entry.
- get_user_pages API (that does the opposite)will know how to re-enable the access to the memory.

Related

memory used by operating system

I think I'm missing a fundamental concept of how the OS manages memory.
OS is responsible for keeping track of what parts of physical memory are free.
OS creates and manages page tables, which have mappings between virtual to physical addresses.
For each instruction that specifies a virtual address, the hardware reads the page table to get the corresponding physical address. One way the hardware may know the location of the current page table is by a register that the OS updates.
This makes sense for how processes access memory. However, I don't understand how the OS itself accesses memory.
Assuming it uses the same instructions, the hardware would still be translating from virtual addresses to physical? Is there, for example, a known physical location for a page table for the OS itself? I know the question is murky, having trouble even understanding what to ask.
At some point there has to be a page table in a physical location. The method used for this depends upon the processor.
Let me give a simple example based upon the VAX processor. Suppose you divide the logical address space into a system range shared by all processes and a user range that is unique to each process. Then give each of those ranges its own page table.
Now you can place the user page table in the system address range of the logical address space.
If you access memory in the user space, you go to page table that the system finds at a logical address in the system space, that the processor then had to translate into a physical address using the page table for the system space; a two level translation.
If you use logical addresses for the the system space page table then you'd have no way of translating those into physical addresses. Instead, the local of the system page table is defined using physical addresses.
Another approach would be to define all page tables using physical addresses.
I don't understand how the OS itself accesses memory.
Think of the operating system as a process. The OS basically is a process, just like other processes, with elevated privileges. Whenever the OS wants to use eome memory location,it uses page tables for virtual to physical address translation, just like other processes would.
Think of it this way: Every process has a page table of its own, the same goes for the OS. The OS remembers the location of these page tables in the control structures associated with each process (e.g. the PCB), and for the currently running process the address (physical pointer) to the page table is kept in hardware (for the x86 architecture this is in control register 4 (CR4)). On x86, whenever the OS switches the running process it changes the value in CR4 so that the address points to the correct page table (its own if it switches to itself).
However, this is greatly simplified in modern operating systems, where the kernel (the OS) is mapped into the memory space of all processes, so that the kernel can run whenever it wants without having to switch page tables (which is costly). The pages in a process' page table belonging to the kernel are restricted to the process, and only accessible once the kernel takes control to do some management task.

How OS catches illegal memory references at paging scheme?

I am trying to understand how the OS catches all illegal memory access in a system which uses Paging. (32 bits, x86, Paging enabled).
To be more specific, let's suppose I have a tiny App which is just 1 Page in size. Considering that a MS OS take the upper half of the 'virtual memory address space' and that my tiny EXE occupies just 4k of lower half of VMAS, then:
1) How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
2) How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
Any advice or comment is wellcome.
EDIT:
Just to make things clear, the answer (compiled form all generous contributions) is:
In order to catch an illegal reference for unallocated memory, the VMAS for the App should be marked as User & Non-Present and the rest of the VMAS should be marked as Kernel & Non-Present.
(Of course, allocated memory is with User attribute. Take note that User & Non-Present is at 'process creation' before its first run!. After that it changes to User & Present).
That way the hardware monitor will catch any access outside of the App boundary!!!
And the Page Fault handler will assume an illegal access because no User code is allowed to access (read/write) a Kernel page.
[VMAS= Virtual Memory Address Space]
1) How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
A sequence of events has to take place. The processor takes as inputs (a) the logical page being accessed; (b) the type of access; and (c) the processor mode to determine whether an access is valid.
Is there a page table entry for the page? If not => access violation
Is the page table entry marked valid?
The processing here is system specific, depending upon whether the page tables can distinguish between an invalid page table entry and an valid entry that is not mapped to a page frame. In the former case => access violation. In the latter case, it triggers a page fault and the OS has to determine whether to trigger an access violation or load the page.
Does the page table permit the type of access for the current processor mode? If not => access violation.
If the hardware triggers an access violation exception, it switches to kernel mode and invokes the OS's access violation handler.
2) How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
Operating systems provide system services for mapping memory into the process address space. Generally, the program loader reads the instructions in the EXE file and calls page mapping system services to set up the initial state of the application.
When this occurs depends upon the operating system. In eunuchs-land, a process is a clone of its parent. The running of a program takes place in an exec___ system call. Some operating system have a background command processor that allows multiple applications to be run sequentially within a single process.
From there, it is up to the application to manage the pages mapped to its address space. That is done by calling system services. For example "malloc" calls will cause the application to use system services to map pages.
The initial state of the application is likely to have holes of invalid user addresses. In fact, the range of valid addresses is not likely to be contiguous within the logical address space.
Each page has, among others, the following attributes: Present and Read/Write.
Accessing a page that is not present, or writing a read-only page, generates a privileged event called a page fault. This event takes the form of the CPU executing a specific routine that the OS set up.
Hence the OS is informed of the event and the attempt that was made.
The structures used to implement paging are hierarchical: pages are grouped into directories, and directory into higher directories. There are usually four levels.
Like in a file system, only the directories needed to reach the specific page need to be created.
A definitive source of information is the Intel manuals, specifically the third volume.
This answer intentionally uses simplified words.
How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
A page fault is raised and the page fault handler gets executed. In the case of an invalid memory access it terminates the program. In the case of an access of swapped memory, it restores the memory contents from the disk into the main memory again and lets the program continue.
How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
On x86, there are two-level page structures: page directories and page tables. Assuming your program fits in a single page, the OS will initialise a page directory that contains only one valid entry pointing to a page table, and only one valid entry pointing to the page containing the needed memory.

Why does the kernel have a separate virtual address for a user page?

I'm confused about this statement:
From http://web.stanford.edu/class/cs140/projects/pintos/pintos_4.html#SEC63:
In Pintos, every user virtual page is aliased to its kernel virtual
page.
I thought the kernel would just be able to use the user virtual address to refer to the user page, with the kernel virtual addresses above it. In the image below, for instance, wouldn't the entire VAS just be from 0 to 4GB, and the user virtual address space would be restricted to addresses below PHYS_BASE, while the kernel could also access the addresses above it?
(from http://web.stanford.edu/class/cs140/cgi-bin/section/10sp-proj3.pdf)
This doesn't seem to be how it works though, as the PintOS documentation continues:
You must manage these aliases somehow. For example, your code
could check and update the accessed and dirty bits for both addresses.
Alternatively, the kernel could avoid the problem by only accessing
user data through the user virtual address.
This implies that the kernel could access the user data through a separate kernel virtual address. I'm not sure why the two addresses would be different.
Thanks for any clarification.
To access a page, it needs to be mapped in your current virtual address space.
So if the kernel wants to access a user page there are 2 solutions :
Map the page in our current address space, the kernel's address space, and make sure the two pages table entries stay consistent (you don't stricly have to keep it consistent, but you really want to).
Switch to an address space where that page is already mapped, the user's own address space
Your kernel seems to be picking option 1, which is a good thing for performance. Switching to another address space and back takes quite a lot time.
It could pick option 2 instead and switch to the user's address space every time it wants to access a user page, this would possibly make the code simpler by avoiding some bookkeeping, but that would be awfully slow.

linux kernel page table update

In linux x86 paging.
each process has it's own page directory.
page table walking starts with page directory which is pointed by CR3.
every process shares the kernel page directory content
assuming three sentences are correct, let's say some process enters kernel
mode and updates his kernel page directory content(address mapping, access
rights, etc...)
Question. since kernel address spaces is globally shared among processes,
this update has to be synchronized with other process's page directory,
right?
how can this be managed?
I don't know about Linux, so I'll answer for Windows. Some of the kernel space is 'global', which is a flag set in the PTE to indicate it is used by more than one process. The INVPCID instruction can be configured in the register operand to include or exclude these entries in a TLB invalidate. These page table entries are shared between the processes and all appear at the same place in the page table for each process. This way, only the single PTE needs to be updated and it doesn't need to synchronise other PTEs of other processes as they all share a single PTE at a physical address.
http://www.cs.miami.edu/home/burt/journal/NT/memory.html
Some kernel memory is not visible to all processes and is private to each process (doesn't change the fact it is still ring 0). This, on a 32 bit Windows system would be 0xC0000000–0xC0200000 which contains all the user space PTEs and PDEs where 0xC0000000 is the PTE_BASE which allows for the equation
#define MiGetPteAddress (x) ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase))
#define MiAddressToPte(x) MiGetPteAddress(x)
to work elegantly for converting faulting virtual address in cr2 to the address of the PTE. This is private to each process as each process has the same base PTE allocation base address; if it were visible to all processes it would quickly take up virtual memory as each set of page takes would have to be allocated sequentially. It doesn't need to be visible to all processes because a process has no interest in the page table entries of another process. A page fault is always handled in the context of the current process, and 0xC0000000–0xC0200000 means something different in each process context.
The kernel space 0xC0200000–0xC0400000 for allocation of kernel PTEs (for kernel addresses) would however be global and shared by all processes, except for the section within it representing 0xC0000000–0xC0200000, which by my calculation will be 0xC0300000–0xC0300800, which is the user-mode side of the PDEs as PDE_BASE = 0xC0300000–0xC0300FFF.
It is however impossible to split up the user PDE and kernel PDE section such that the former is private and the latter is global (i.e. make 0xC0300000–0xC0300800 private (point to different physical addresses) and 0xC0300000–0xC0300FFF point to the same physical address for each process) because the whole PDE region (0xC0300000–0xC0300FFF) will lie on the same physical frame and constitutes a single frame pointed to by cr3, and the cr3 is different for each process, which means that the whole PDE region (all PDEs) would have to private per process (duplicated and installed per process). If a kernel page table page (a page containing a kernel page table) were paged out and in to a new physical location then the PDEs would all have to be synchronised because all processes have copies at different cr3 physical addresses and not the same physical PDE. I'm not sure how it does this (efficiently) ATM therefore it would be wise to impose the restriction of not allowing the kernel page tables to be paged out and have them in non-paged pool; this way the kernel PDEs will remain constant across all CR3 pages. On 64 bit, the restriction could be imposed that kernel PDPTs can't be paged out. On 32 bit Windows, a process is started with a physical CR3 page with a PDE at offset 1100000000(base 2)*4 bytes pointing to itself which is hardwritten in, probably by briefly turning off paging in cr0 (because the write won't succeed without the recursive entry that needs to be written being there, creating a paradox). Notice, the PD Entry for itself is the page table that covers the range 0xC0000000–0xC0400000 i.e. it points to 1023 page tables and 1 page directory (itself) (2^10 entries) and hence allows the PTEs to be modified by their virtual address. The reason why the CR3 page is at 0xC0300000 is because the address has the same page directory and page table indexes 1100000000 and 1100000000 so it loops back on itself twice, therefore yielding the CR3 page and you can modify the PDEs by address (there are other addresses that are special like this e.g. 0xE0380000). After it is set up, the appropriate kernel mappings are made. On 64 bit Windows it would be similar where a process is set up with a single PML4 table page which points to itself and this way any PML4E, PDPTE, PDE or PTE can be filled in and accessed due to the variable amount of loopbacks. On 64 bit Windows, when a process is terminated, all the physical pages of the process get moved to the free list which would include all user physical PDPT pages, PD pages, PT pages and the PML4/CR3 page. The kernel ones would not be marked for the free list.
In general, if you know what entry in the PML4 is the recursive entry to the physical PML4 page you, can work out the virtual address of the PTE structure that services (is used to translate) a particular virtual address range and a particular virtual address in that range. You append the offset (10 bits for 32 bit; 9 bits for 64 bit) in the PML4 to the entry to itself, to the start of a virtual address whose servicing PTE virtual address you want to find (which is what the addition of 0xC0000000 is in the 32 bit equation earlier) and remove the last 12 bits and then make up the offset in the PT now at the end of the virtual address to 12 bits by multiplying it by 8 (or 4) (hence the right shift by 12 and the left shift by 3 (or 2 for 32 bit entries)). 1 loopback takes away 1 layer of indirection and you get the virtual address of the PTE. 2 loopbacks will leave you with the virtual address of the PDE that's used to translate that particular virtual address and so on. PTE_BASE on 32 bit windows is the offset 110000000 left shifted to make 32 bits and PDE_BASE is the offset 110000000110000000 left shifted to make 32 bits. It is used in the macro and any virtual address with this prefix will by definition be part of a PTE or a PDE respectively. Windows chooses the offset 1100000000 for the page table hierarchy but it could be any one of the 2^9 combinations.
KAISER, or KPTI, designed to mitigate meltdown, most likely has 2 cr3s for each process. Upon trapping to the kernel, the restricted cr3 for user mode which would contain a single kernel PML4E—enough for a preliminary interrupt dispatch routine function to be accessible, which performs the swap—would be replaced with the full cr3 containing all kernel PML4Es.
As for physical memory on windows, see here: https://superuser.com/a/1549970/933117
Question. since kernel address spaces is globally shared among processes, this update has to be synchronized with other process's page directory, right?
how can this be managed?
First; understand that paging is usually 2 or more levels of tables. For example (for 80x86), for the oldest "plain 32-bit paging" there are page tables and page directories; and for current long mode there's page map level 4, page directory pointer table, page directory and page table. CR3 points to the highest level table and that must be different for each virtual address space ("process"). For the second highest level table, a single second highest level table can be put into all highest level tables, and if you do that any changes to the second highest level table will automatically change every virtual address space.
This means that (for 80x86), for the oldest "plain 32-bit paging" you can put the same "kernel page table" into all virtual address spaces (all page directories) and when you add/remove pages from that page table it will automatically affect all virtual address spaces; and for current long mode you can put the same page directory pointer table in all virtual address spaces (all page map level 4 tables) and when you add/remove page directories, page tables, or pages, it will automatically affect all virtual address spaces.
This means that you only really need some way to change second highest level page tables (or, some way to change all highest level page table entries). There are multiple ways to do this. The easiest is pre-allocation. For example, if you say "kernel space will always be N MiB" you can pre-allocate all the second highest level tables you'd need for "N MiB" during boot and never change them (e.g. for long mode, you could say that kernel space will be 512 GiB, pre-allocate a single "kernel page directory pointer table", and put that into every page map level 4 when a virtual address space is being created, and then rely on all other changes (to page directories, page tables, etc) automatically affecting the kernel space for all virtual address spaces). I believe this is the method Linux uses (partly because Linux uses the silly "map all RAM into kernel space" security disaster at boot).
However, this is just the table changes alone. There are 2 other concerns.
The first "other concern" is the CPU's translation look-aside buffers (TLBs); which need to be flushed when (virtual address to physical address) translation/s change. Most operating systems use a combination of "lazy TLB shootdown" (where a CPU using wrong information from TLB causes a page fault and page fault handler invalidates and returns so the software that caused the page fault can continue with the new/correct translation without knowing anything happened) and "multi-CPU TLB shootdown" (where you send an inter-processor interrupt to other CPUs and that interrupt handler invalidates the TLB entries).
The second "other concern" is making sure CPUs don't try to change the same thing at the same time. This typically ends up being a problem solved at a higher level. For example, if you acquire a lock for a certain data structure (before changing something in that data structure) and realize you need to allocate/free pages for that data structure (while you're trying to make the changes); then the code that modifies paging tables doesn't need to care about different CPUs changing the page tables at the same time because it knows that something at a higher level (the data structure's lock) already ensures that can't happen.
When the kernel changes page table entries, these updates must be made atomically:
In the 64bit kernel this can be conveniently done using 64bit memory operations, while i386 needs to use CMPXCHG8.
(Source)

Kernel mode transition

If I understand correctly, a memory adderss in system space is accesible only from kernel mode. Does it mean when components mapped in system space are executed the processor must be swicthed to kernel mode?
For ex: the virtual memory manager is a frequently used component and is mapped in system space. Whenever the VMM runs in the context of user process (lets say it translated an address), does the processor must be swicthed to kernel mode?
Thanks,
Suresh.
Typically, there's 2 parts involved.The MMU(Memory manage unit) which is a hardware component that does the translation from virtual addresses to physical addresses. And the operating system VM subsystem.
The operating system part needs to run in privileged mode (a.k.a. kernel mode) and will set up/change the mapping in the MMU based on the the user space needs.
E.g. to request more (virtual) memory, or map a file into memory, a transition to kernel mode is needed and the VM subsystem can change the mapping of the process.
Around this there's often a ton of tricks to be made - e-g. map the whole address space of the kernel into the user process virtual space, but change its access so the process can't use that memory - this means whenever you transit to kernel mode you don't need to reload the mapping for the kernel.
Taking your example of the virtual memory manager, it never actually runs in user space. To allocate memory, user mode applications make calls to the Win32 API (NTDLL.DLL as one example) to routines such as VirtualAlloc.
With regards to address translation, here's a summary of how it works (based on the content from Windows Internals 5th Edition).
The VMM uses page tables which the CPU uses to translate virtual addresses to physical addresses. The page tables live in the system space. Each table contains many PTEs (page table entries) which stores the physical address to which a virtual address is mapped. I won't go into too much detail here, but the point is that all of the VMM's work is performed in system space and not in user space.
As for context switching - when a thread running in user space needs to run in the system space, then a context switch will occur. Since the memory manager lives in system space, it's threads never need to make a context switch, since it already lives in the system space.
Apologies for the simplistic explanation, this is quite a complicated topic of discussion in depth. I would highly recommend that you pick up a copy of Windows Internals as this sounds like it would come in handy for you.

Resources