Interaction of driver-supplied vm_operations_struct.fault method with page cache - linux-kernel

https://manybutfinite.com/post/page-cache-the-affair-between-memory-and-files/ says "all regular file I/O happens through the page cache" (including for mmapped files). Great!
However, device drivers have the authority to choose the physical page returned by a page fault to an mmapped VMA, by installing a .fault method to the VMA's vm_operations_struct (example).
Let's say a file backed by such a driver is mmapped by a userspace app, and an access to the mapped VA page faults. What physical page ends up in the page cache representing that section of the file? Does the kernel accept the physical page returned by the driver's .fault method as the "kernel-official" page cache entry representing that section of the file? Or does the kernel take the page returned by the driver's .fault, copy it to some other kernel-managed page, and set that other page as the "official" page cache entry? Or does something else happen?
Thanks in advance for your help!

Related

What is in the PTE address field for an anonymously zero-fill-on-demand mapped page?

When a program calls mmap to allocate an anonymous page, also known as a demand-zero page, what appears in the address field of the corresponding page table entry (PTE)? I am assuming that the kernel does not create a zero-initialized page in physical memory (and enter that physical page's page number into the PTE) until the requesting process actually touches the page — hence the term demand-zero. Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there? As a different but related question, how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?
I am assuming that the kernel does not create a zero-initialized page in physical memory
Indeed, this is usually the case. Unless special cases, like for example if MAP_POPULATE is specified to explicitly request the page to be initialized (also called "pre-fauting").
what appears in the address field of the corresponding page table entry (PTE)?
Right after mmap you don't even have a PTE allocated for the page (or in general, you don't have any entry at any page table level). For what the CPU is concerned, the page doesn't even exist. If you were to walk the page table you would just get to a point (at an arbitrary level) where the corresponding entry is marked as "not present".
Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there?
For what the CPU is concerned, the page is unallocated. At the first page fault, two things can happen:
For a read page fault, the PTE is updated to point to the zero page: this is a special page that is always entirely zeroed-out and is pointed to by the PTEs of any anonymous (demand-zero) page in the system that has not been modified yet.
For a write page fault, an actual physical page will be allocated and the corresponding PTE updated to point to its physical address.
Quoting directly from the documentation:
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?
When a page fault occurs, the kernel page fault handler (architecture-dependent) determines to which VMA the page belongs to, and retrieves the corresponding struct vm_area_struct (which was created earlier either by the kernel itself or by a mmap syscall). This structure is then passed on to architecture-independent code (do_fault()) along with the needed fault information (struct vm_fault).
The vm_area_struct then contains all the remaining necessary information to handle the fault (for example the ->vm_file field which is != NULL in case of a file-backed mapping). The field ->vm_ops points to a struct vm_operations_struct which defines a set of function pointers to call in different occasions. In particular anonymous VMAs have ->vm_ops == NULL.
For other kind of pages, ->fault() is the function used when handling a page fault. This function knows what to check and how to actually handle the fault.
B & O also describe the VMA, but do not explain how the kernel could use the VMA to distinguish between, say, an unallocated page and an allocated page to be created and zero-initialized.
Simple, just check vma->vm_ops == NULL and in such case you know that the page is a demand-zero anon page. Then on a page fault act as needed (read fault -> update PTE to point to global zero page, write fault -> allocate a page and update PTE).

How OS catches illegal memory references at paging scheme?

I am trying to understand how the OS catches all illegal memory access in a system which uses Paging. (32 bits, x86, Paging enabled).
To be more specific, let's suppose I have a tiny App which is just 1 Page in size. Considering that a MS OS take the upper half of the 'virtual memory address space' and that my tiny EXE occupies just 4k of lower half of VMAS, then:
1) How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
2) How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
Any advice or comment is wellcome.
EDIT:
Just to make things clear, the answer (compiled form all generous contributions) is:
In order to catch an illegal reference for unallocated memory, the VMAS for the App should be marked as User & Non-Present and the rest of the VMAS should be marked as Kernel & Non-Present.
(Of course, allocated memory is with User attribute. Take note that User & Non-Present is at 'process creation' before its first run!. After that it changes to User & Present).
That way the hardware monitor will catch any access outside of the App boundary!!!
And the Page Fault handler will assume an illegal access because no User code is allowed to access (read/write) a Kernel page.
[VMAS= Virtual Memory Address Space]
1) How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
A sequence of events has to take place. The processor takes as inputs (a) the logical page being accessed; (b) the type of access; and (c) the processor mode to determine whether an access is valid.
Is there a page table entry for the page? If not => access violation
Is the page table entry marked valid?
The processing here is system specific, depending upon whether the page tables can distinguish between an invalid page table entry and an valid entry that is not mapped to a page frame. In the former case => access violation. In the latter case, it triggers a page fault and the OS has to determine whether to trigger an access violation or load the page.
Does the page table permit the type of access for the current processor mode? If not => access violation.
If the hardware triggers an access violation exception, it switches to kernel mode and invokes the OS's access violation handler.
2) How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
Operating systems provide system services for mapping memory into the process address space. Generally, the program loader reads the instructions in the EXE file and calls page mapping system services to set up the initial state of the application.
When this occurs depends upon the operating system. In eunuchs-land, a process is a clone of its parent. The running of a program takes place in an exec___ system call. Some operating system have a background command processor that allows multiple applications to be run sequentially within a single process.
From there, it is up to the application to manage the pages mapped to its address space. That is done by calling system services. For example "malloc" calls will cause the application to use system services to map pages.
The initial state of the application is likely to have holes of invalid user addresses. In fact, the range of valid addresses is not likely to be contiguous within the logical address space.
Each page has, among others, the following attributes: Present and Read/Write.
Accessing a page that is not present, or writing a read-only page, generates a privileged event called a page fault. This event takes the form of the CPU executing a specific routine that the OS set up.
Hence the OS is informed of the event and the attempt that was made.
The structures used to implement paging are hierarchical: pages are grouped into directories, and directory into higher directories. There are usually four levels.
Like in a file system, only the directories needed to reach the specific page need to be created.
A definitive source of information is the Intel manuals, specifically the third volume.
This answer intentionally uses simplified words.
How OS realizes that there is an 'illegal memory reference/access' going on when my code tries to write to a memory location outside from my own Exe's 4k? (Obviously, that pointer wasn't obtained from a 'malloc' or similar call).
A page fault is raised and the page fault handler gets executed. In the case of an invalid memory access it terminates the program. In the case of an access of swapped memory, it restores the memory contents from the disk into the main memory again and lets the program continue.
How are Page Tables managed for that tiny Exe? Does OS have to define all 1 M Page Entries (-1 Page Entry) with a 'Non-Present' attribute set and 'System' owned? (When that 'process' is created).
On x86, there are two-level page structures: page directories and page tables. Assuming your program fits in a single page, the OS will initialise a page directory that contains only one valid entry pointing to a page table, and only one valid entry pointing to the page containing the needed memory.

Unmapping a "twicely" mapped page in the Linux Kernel

I use kmap to get the first virtual address of a low-memory page, inside a Linux Kernel module.
What happens if I call kunmap after that mapping? Is the persistent page mapping totally deleted or just some mapping counter is decreased? (should be 2 before the unmapping)
There is no mapping done by kmap if page belongs to low memory and hence there is no action done by kunmap too, but calling them is harmless as these checks are handled in there implementation.
First about kmap
kmap checks if page is below highmem_start_page(ie lowmemory page) as
pages from lowmem are already visible and do not need to be mapped. If
the page is already in low memory kmap simply returns the address of it.
Now about kunmap
kunmap checks if page is below highmem_start_page. If it is, the
page already exists in low memory and needs no further handling, hence nop.

Why there is no SIGSEGV signal on copy on write?

The copy-on-write article on wikipedia says that copy-on-write is usually implemented by giving read only access to the pages, so that when one is written, the page fault trap handler can map a unique physical memory page for it. So my question is why a user-level application doesn't receive a SIGSEGV signal when such page fault happens? Afterall, the wikipedia article on SIGSEGV says that SIGSEGV is the signal sent to a process when it makes an invalid memory reference, or segmentation fault. So in this case, that is on copy-on-write case, why no SIGSEGV is sent to the process.
I know it's been a while since this was asked, but I wanted to expand on Alexey's answer a bit.
Copy-on-write (I assume you're talking about virtual memory and not filesystems) usually works like so:
The OS knows which pages need to be copied on write. (They are the pages which are private to a process.) These pages are marked in hardware as read-only. However, the virtual memory map of the process has the pages marked as readable and writable. This means that the user process believes it has full access to the pages in question.
When a user process attempts to write to one of these pages, a page fault is generated because the processor recognizes that the page is read-only (based on the hardware marks before). Page faults are sort of like segfaults, but for the kernel instead of for user processes.
This triggers the page fault handler to run within the kernel, which looks at the page in question and sees that it's a private page which has not yet been copied. The handler will create a copy of the page and mark the copy as writable.
Then the handler will replace the old page's address with the new one in the virtual-to-physical translation table and exit.
The last instruction will be retried by the user process at this point, and this time the write will succeed because the new page is writeable at both the virtual memory map (the user process' view of memory permissions) and hardware (the kernel's view of memory permissions) levels.
A page fault is generated every time a segmentation fault occurs, but most page faults are handled by the kernel and are never passed to the process that caused them as segfaults. There are many reasons why a page fault might be handled at a lower level, including:
The page which was accessed was paged out to disk because it hadn't been used in a long time. The OS must bring it back into memory so the process can use it again.
The process is accessing a newly-allocated page for the first time, and the actual physical page hasn't been allocated yet. The OS must allocate a page and then insert it into the virtual-to-physical translation table before the memory can actually be used.
The OS is playing a hardware page access permissions trick to allow it to watch for accesses to a particular page. This is what happens in copy-on-write, but it can have other uses as well. Consider an OS-level virtualization technology like kvm, where writing to a memory-mapped device's location in memory in the guest OS should actually write to a file or the display in the host OS.
The main idea of COW is that COW is completely transparent to the user process as if it fully owned the memory without any sharing.

Can a TLB hit lead to page fault in memory?

In UC Berkley Video lectures on OS by John Kubiatowicz (Prof. Kuby) available on web, he mentioned that TLB hit doesn't mean that corresponding page is in main memory. Page fault can still occur.
Technically TLBs are cache for page table entry and since all page table entries don't have their corresponding page available in main memory. Same can be true for TLBs. A TLB hit may lead to page fault.
But according to algorithms given in text books I am unable to find such a case. On a TLB miss kernel refer to page tables and update the TLB cache for appropriate address translation. Next TLB hit can't lead to page fault. When kernel swap out the page, it updates the appropriate bits for that page table entry and invalidate the corresponding TLB, so there can't be a TLB hit next time until page is loaded in main memory.
So can someone stand for correctness of Prof kuby's claim and point out a case when instead of TLB hit (the translated physical address for corresponding virtual address in found in TLB), a page fault can occur?
One example is if the memory access is different from the allowed one.
e.g. you want to write to memory that's write protected. A TLB exists, it's a hit and the address is translated. But on access you get a trap, as you're trying to write to memory that's read-only
A page fault doesnt mean a missing page in the memory. A page can still be present and be dirty. This is also a page fault.
On a general note, the page fault refers to the scenario where the obtained translation cannot be effectively used.
It may be a missing page or a dirty page or access permission mismatch.
So a TLB hit can still lead to a page fault.
patterson says:"cannot have a translation in TLB if page is not present in memory" [computer organization and design,4th ed revised, page 507]

Resources