How do modern OS kernels (UNIX, Windows) distinguish between page faults? [closed]

How do modern OS kernels (UNIX, Windows) distinguish between page faults? [closed] - memory-management

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to understand how page faults are handled by the OS kernel. The wikipedia article at https://en.wikipedia.org/wiki/Page_fault distinguishes between Minor, Major and Invalid page faults.
A major page fault is one where the virtual->real address mapping is not yet present in main memory, but the real address is present on disk, and for this page fault exception, the exception handler searches the disk and brings the page frame to main memory and does the virtual->real address mapping.
An invalid page fault is when an application tries to access an unmapped address, for example, a rogue pointer. The same page fault exception is raised, but the exception handler now decides to terminate the program, mostly with a Seg Fault (core dumped) error.
My question is, how does the kernel distinguish between these two types of page faults? I'd like the answer to go into a bit of depth about this, and hopefully link me to more elaborate articles if possible. Please ask me for any clarifications!
Thanks.

Grossly speaking, the kernel has some representation of the virtual address space of the (current) process. It knows for each segment of pages how to handle page faults for it. It works in physical addresses (so its address space is not the user-mode address space), but maintain some complex data structures to efficiently represent the mapping between virtual and physical addresses (if any) and configure the MMU according to these.
See for example Gorman's book understanding the Linux virtual memory manager (some details are probably outdated).
Read also about GNU Hurd external pager mechanism.
A page fault is given the relevant (physical and/or virtual) addresses at fault (e.g. by MMU hardware). See the paging web page of osdev, and read about page tables. The kernel handles all page faults (it gets the same hardware exception for every page faults, with data describing the fault - including faulting virtual address) and determine what kind of fault it is.
On Linux you could even handle (in a non-portable, ABI & processor specific manner) the SIGSEGV signal. (Hence the kernel has gathered all the information it is able to give to your SIGSEGV handler. But read carefully signal(7)). But it is usually not worth the pain.
Look also inside the mm/ subtree of the Linux kernel source.
Read also the extensive documentation of Intel processors. Perhaps read some books on processor architecture and on operating systems, and study simpler architectures (like MMIX or RISC-V).
See Operating Systems : Three Easy Pieces notably its introduction to paging.

I would ignore the model in the Wikipedia article. An invalid page fault, is not a page fault at all but rather a failure of logical memory translation.
The concept of a major and minor page fault, IMHO is confusing. In fact, the Wikipedia article describes two different things as being a minor page fault. I even wonder if something different was intended than how the text reads.
I would rethink as this:
A process accesses a memory address.
The memory management unit attempts to translate the referenced LOGICAL PAGE to a PHYSICAL PAGE FRAME using the page tables.
If no such translation is possible (no corresponding table table entry, page table entry is marked as invalid), an access violation fault exception of some kind is generated (Invalid Page Fault in the Wiki article).
If there is already a direct mapping between the logical page and the physical page frame, we're all done.
If the page table indicates there is no physical page frame corresponding to the logical page at the moment, the CPU triggers a page fault exception.
The page fault handler executes.
The page fault handler has to find where the logical (now a virtual) page is stored.
During this process, the page fault handler may find that the page is sitting in physical memory already. There are a number of ways in which this can occur. Be this the case, all the page fault handler has to do is update the page table to reference the physical page frame and restart the instruction (This is one of the circumstances the wiki article calls "minor page fault"). All done.
The other alternative is that the virtual page is stored on disk in a page file, executable file, or shared file. In that case, the handler needs to allocate a physical page frame, read the virtual page from disk into the page frame, update the page table, then restart the instruction (what the wiki calls a "major page fault"). Because of the disk read, the "major" fault takes much longer to handle than the "minor" fault.
One of the functions of the operating system is to keep track of where all the virtual pages are stored. The specific mechanism used to find the page will depend upon a nmber of factors.

Related

What happens in the kernel when the process accesses an address just allocated with brk/sbrk?

This is actually a theoretical question about memory management. Since different operating systems implement things differently, I'll have to relieve my thirst for knowledge asking how things work in only one of them :( Preferably the open source and widely used one: Linux.
Here is the list of things I know in the whole puzzle:
malloc() is user space. libc is responsible for the syscall job (calling brk/sbrk/mmap...). It manages to get big chunks of memory, described by ranges of virtual addresses. The library slices these chunks and manages to respond the user application requests.
I know what brk/sbrk syscalls do. I know what 'program break' means. These calls basically push the program break offset. And this is how libc gets its virtual memory chunks.
Now that user application has a new virtual address to manipulate, it simply writes some value to it. Like: *allocated_integer = 5;. Ok. Now, what? If brk/sbrk only updates offsets in the process' entry in the process table, or whatever, how the physical memory is actually allocated?
I know about virtual memory, page tables, page faults, etc. But I wanna know exactly how these things are related to this situation that I depicted. For example: is the process' page table modified? How? When? A page fault occurs? When? Why? With what purpose? When is this 'buddy algorithm' called, and this free_area data structure accessed? (http://www.tldp.org/LDP/tlk/mm/memory.html, section 3.4.1 Page Allocation)

Well, after finally finding an excellent guide (http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) and some hours digging the Linux kernel, I found the answers...
Indeed, brk only pushes the virtual memory area.
When the user application hits *allocated_integer = 5;, a page fault occurs.
The page fault routine will search for the virtual memory area responsible for the address and then call the page table handler.
The page table handler goes through each level (2 levels in x86 and 4 levels in x86_64), allocating entries if they're not present (2nd, 3rd and 4th), and then finally calls the real handler.
The real handler actually calls the function responsible for allocating page frames.

Handling MMU translation faults in instruction stream - what happens to MMU?

This question is not specific to any CPU implementation, but CPU-specific answers are welcomed.
I am currently implementing a full MMU-enabled CPU, and a simple issue arose.
So, imagine the situation where a simple TLB miss happens caused by the instruction stream (or instruction cache). This would trigger a TLB miss. Now, if the PTE is not found, some exception will be triggered, like a "Page Translation Fault". So far, no problem at all.
Now, in order to call the fault handler, the instruction stream (or cache) needs to fetch the exception handler code. For that it will need to search again for the relevant PTE entry in TLB, and eventually another table walk.
Imagine that, again, the PTE entry is not found. One would expect some other exception handler to be called.
Now, on this last exception handler, since the handler itself might not be found or be valid, does MMU gets disabled before the handler is fetched and executed (thus bypassing everyting MMU does, including Phys-Virt mapping), or is there another technique (non-fatal) to deal with this situation ?
Alvie

I can't say this with certainty about real world operating system, but from the little experience in looking at small kernels, the emphasis always seems to be in ensuring that the page fault handler by itself is never paged out and is always in a location that never raises a page fault. This would make sure that a situation as described in your problem never arises.
In general, it seems to make sense that some part of the core kernel code resides statically on the physical memory with known mapping; but given that you were anyway trying to write a full blown virtual memory enabled OS, I guess you would know be knowing that.

There are two ways I'm aware of:
MMU is disabled automatically when interrupt/exception occur. So fault handler (data abort handler) has to be placed at known physical address and spurious MMU faults are out of question. That's a responsibility of a handler to reenable MMU before returning from an exception or for handler usage itself. That behaviour, in real life, quite a pain in an ass...
For example 'Microblaze' arch does exactly that.
MMU is not disabled automatically. The trick is to have 2 set of TLB tables. TLB1 has kernel mapping tables, TLB0 is made for an user apps mapping tables. Respectively kernel & user apps should have appropriate linkage to exclude the overlapping of virtual addresses between each other.
When user app does a sh** and cause a MMU fault, exception occurs. Abort/fault handler is in kernel memory space so handler code will be accessed with different TLB. You should be damn sure that kernel TLB is correct :)
If kernel exception handler generates exception itself then there is a probability of spurious data and/or instruction aborts.
In practice however, "ARM-Ax" CPUs, for instance, mask exceptions/interrupts when they are taken. I think spurious exceptions do not occur, I've never tested that in practice though.
And well HW watchdog might give you a favour...

How an assembler instruction could not read the memory it is placed at

Using some software in Windows XP that works as a Windows service and doing a restart from the logon screen I see an infamous error message
The instruction at "00x..." referenced memory at "00x...". The memory
could not be read.
I reported the problem to the developers, but looking at the message once again, I noticed that the addresses are the same. So
The instruction at "00xdf3251" referenced memory at "00xdf3251". The memory
could not be read.
Whether this is a bug in the program or not, but what is the state of the memory/access rights or something else that prevents an instruction from reading the memory it is placed. Is it something specific to services?

I would guess there was an attempt to execute an instruction at the address 0xdf3251 and that location wasn't backed up by a readable and executable page of memory (perhaps, completely unmapped).
If that's the case, the exception (page fault, in fact) originates from that instruction and the exception handler has its address on the stack (the location to return to, in case the exception can be somehow resolved and the faulting instruction restarted when the handler returns). And that's the first address you're seeing.
The CR2 register that the page fault handler reads, which is the second address you're seeing, also has the same address because it has to contain the address of an inaccessible memory location irrespective of whether the page fault has been caused by:
complete absence of mapping (there's no page mapped at all)
lack of write permission (the page is read-only)
lack of execute permission (the page has the no-execute bit set) OR
lack of kernel privilege (the page is marked as accessible only in the kernel)
and irrespective of whether it was during a data access or while fetching an instruction (the latter being our case).
That's how you can get the instruction and memory access addresses equal.
Most likely the code had a bug resulting in a memory corruption and some pointer (or a return address on the stack) was overwritten with a bogus value pointing to an inaccessible memory location. And then one way or the other the CPU was directed to continue execution there (most likely using one of these instructions: jmp, call, ret). There's also a chance of having a race condition somewhere.

This kind of crash is most typically caused by stack corruption. A very common kind is a stack buffer overflow. Write too much data in an array stored on the stack and it overwrites a function's return address with the data. When the function then returns, it jumps to the bogus return address and the program falls over because there's no code at the address. They'll have a hard time fixing the bug since there's no easy way to find out where the corruption occurred.
This is a rather infamous kind of bug, it is a major attack vector for malware. Since it can commandeer a program to jump to arbitrary code with data. You ought to have a sitdown with these devs and point this out, it is a major security risk. The cure is easy enough, they should update their tools. Countermeasures against buffer overflow are built into the compilers these days.

Why there is no SIGSEGV signal on copy on write?

The copy-on-write article on wikipedia says that copy-on-write is usually implemented by giving read only access to the pages, so that when one is written, the page fault trap handler can map a unique physical memory page for it. So my question is why a user-level application doesn't receive a SIGSEGV signal when such page fault happens? Afterall, the wikipedia article on SIGSEGV says that SIGSEGV is the signal sent to a process when it makes an invalid memory reference, or segmentation fault. So in this case, that is on copy-on-write case, why no SIGSEGV is sent to the process.

I know it's been a while since this was asked, but I wanted to expand on Alexey's answer a bit.
Copy-on-write (I assume you're talking about virtual memory and not filesystems) usually works like so:
The OS knows which pages need to be copied on write. (They are the pages which are private to a process.) These pages are marked in hardware as read-only. However, the virtual memory map of the process has the pages marked as readable and writable. This means that the user process believes it has full access to the pages in question.
When a user process attempts to write to one of these pages, a page fault is generated because the processor recognizes that the page is read-only (based on the hardware marks before). Page faults are sort of like segfaults, but for the kernel instead of for user processes.
This triggers the page fault handler to run within the kernel, which looks at the page in question and sees that it's a private page which has not yet been copied. The handler will create a copy of the page and mark the copy as writable.
Then the handler will replace the old page's address with the new one in the virtual-to-physical translation table and exit.
The last instruction will be retried by the user process at this point, and this time the write will succeed because the new page is writeable at both the virtual memory map (the user process' view of memory permissions) and hardware (the kernel's view of memory permissions) levels.
A page fault is generated every time a segmentation fault occurs, but most page faults are handled by the kernel and are never passed to the process that caused them as segfaults. There are many reasons why a page fault might be handled at a lower level, including:
The page which was accessed was paged out to disk because it hadn't been used in a long time. The OS must bring it back into memory so the process can use it again.
The process is accessing a newly-allocated page for the first time, and the actual physical page hasn't been allocated yet. The OS must allocate a page and then insert it into the virtual-to-physical translation table before the memory can actually be used.
The OS is playing a hardware page access permissions trick to allow it to watch for accesses to a particular page. This is what happens in copy-on-write, but it can have other uses as well. Consider an OS-level virtualization technology like kvm, where writing to a memory-mapped device's location in memory in the guest OS should actually write to a file or the display in the host OS.

The main idea of COW is that COW is completely transparent to the user process as if it fully owned the memory without any sharing.

Allocating a buffer of more a page size on stack will corrupt memory?

In Windows, stack is implemented as followed: a specified page is followed committed stack pages. It's protection flag is as guarded. So when thead references an address on the guared page, an memory fault rises which makes memory manager commits the guarded page to the stack and clean the page's guarded flag, then it reserves a new page as guarded.
when I allocate an buffer which size is more than one page(4KB), however, an expected error haven't happen. Why?

Excellent question (+1).
There's a trick, and few people know about it (besides driver writers).
When you allocate large buffer on the stack - the compiler automatically adds so-called stack probes. It's an extra code (implemented in CRT usually), which probes the allocated region, page-by-page, in the needed order.
EDIT:
The function is _chkstk.

The fault doesn't reach your program - it is handled by the operating system. Similar thing happens when your program tries to read memory that happens to be written into the swap file - a trap occurs and the operating system unswaps the page and your program continues.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio