Handling MMU translation faults in instruction stream - what happens to MMU? - caching

This question is not specific to any CPU implementation, but CPU-specific answers are welcomed.
I am currently implementing a full MMU-enabled CPU, and a simple issue arose.
So, imagine the situation where a simple TLB miss happens caused by the instruction stream (or instruction cache). This would trigger a TLB miss. Now, if the PTE is not found, some exception will be triggered, like a "Page Translation Fault". So far, no problem at all.
Now, in order to call the fault handler, the instruction stream (or cache) needs to fetch the exception handler code. For that it will need to search again for the relevant PTE entry in TLB, and eventually another table walk.
Imagine that, again, the PTE entry is not found. One would expect some other exception handler to be called.
Now, on this last exception handler, since the handler itself might not be found or be valid, does MMU gets disabled before the handler is fetched and executed (thus bypassing everyting MMU does, including Phys-Virt mapping), or is there another technique (non-fatal) to deal with this situation ?
Alvie

I can't say this with certainty about real world operating system, but from the little experience in looking at small kernels, the emphasis always seems to be in ensuring that the page fault handler by itself is never paged out and is always in a location that never raises a page fault. This would make sure that a situation as described in your problem never arises.
In general, it seems to make sense that some part of the core kernel code resides statically on the physical memory with known mapping; but given that you were anyway trying to write a full blown virtual memory enabled OS, I guess you would know be knowing that.

There are two ways I'm aware of:
MMU is disabled automatically when interrupt/exception occur. So fault handler (data abort handler) has to be placed at known physical address and spurious MMU faults are out of question. That's a responsibility of a handler to reenable MMU before returning from an exception or for handler usage itself. That behaviour, in real life, quite a pain in an ass...
For example 'Microblaze' arch does exactly that.
MMU is not disabled automatically. The trick is to have 2 set of TLB tables. TLB1 has kernel mapping tables, TLB0 is made for an user apps mapping tables. Respectively kernel & user apps should have appropriate linkage to exclude the overlapping of virtual addresses between each other.
When user app does a sh** and cause a MMU fault, exception occurs. Abort/fault handler is in kernel memory space so handler code will be accessed with different TLB. You should be damn sure that kernel TLB is correct :)
If kernel exception handler generates exception itself then there is a probability of spurious data and/or instruction aborts.
In practice however, "ARM-Ax" CPUs, for instance, mask exceptions/interrupts when they are taken. I think spurious exceptions do not occur, I've never tested that in practice though.
And well HW watchdog might give you a favour...

Related

How to access physical address during interrupt handler linux

I wrote an interrupt handler in linux.
Part of the handler logic I need to access physical address.
I used iormap function but then I fell into KDB during handler time.
I started to debug it and i saw the below code which finally called by ioremap
What should I do? Is there any other way instead of map the region before?
If i will need to map it before it means that i will probably need to map and cache a lot of unused area.
BTW what are the limits for ioremap?
Setting up a new memory mapping is an expensive operation, which typically requires calls to potentially blocking functions (e.g. grabbing locks). So your strategy has two problems:
Calling a blocking function is not possible in your context (there is no kernel thread associated with your interrupt handler, so there is no way for the kernel to resume it if it had to be put to sleep).
Setting up/tearing down a mapping per IRQ would be a bad idea performance-wise (even if we ignore the fact that it can't be done).
Typically, you would setup any mappings you need in a driver's probe() function (or in the module's init() if it's more of a singleton thing). This mapping is then kept in some private device data structure, which is passed as the last argument to some variant of request_irq(), so that the kernel then passes it back as the second argument to the IRQ handler.
Not sure what you mean by "need to map and cache a lot of unused area".
Depending on your particular system, you may end up consuming an entry in your CPU's MMU, or you may just re-use a broader mapping that was setup by whoever wrote the BSP. That's just the cost of doing business on a virtual memory system.
Caching is typically not enabled on I/O memory because of the many side-effects of both reads and writes. For the odd cases when you need it, you have to use ioremap_cached().

What happens in the kernel when the process accesses an address just allocated with brk/sbrk?

This is actually a theoretical question about memory management. Since different operating systems implement things differently, I'll have to relieve my thirst for knowledge asking how things work in only one of them :( Preferably the open source and widely used one: Linux.
Here is the list of things I know in the whole puzzle:
malloc() is user space. libc is responsible for the syscall job (calling brk/sbrk/mmap...). It manages to get big chunks of memory, described by ranges of virtual addresses. The library slices these chunks and manages to respond the user application requests.
I know what brk/sbrk syscalls do. I know what 'program break' means. These calls basically push the program break offset. And this is how libc gets its virtual memory chunks.
Now that user application has a new virtual address to manipulate, it simply writes some value to it. Like: *allocated_integer = 5;. Ok. Now, what? If brk/sbrk only updates offsets in the process' entry in the process table, or whatever, how the physical memory is actually allocated?
I know about virtual memory, page tables, page faults, etc. But I wanna know exactly how these things are related to this situation that I depicted. For example: is the process' page table modified? How? When? A page fault occurs? When? Why? With what purpose? When is this 'buddy algorithm' called, and this free_area data structure accessed? (http://www.tldp.org/LDP/tlk/mm/memory.html, section 3.4.1 Page Allocation)
Well, after finally finding an excellent guide (http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) and some hours digging the Linux kernel, I found the answers...
Indeed, brk only pushes the virtual memory area.
When the user application hits *allocated_integer = 5;, a page fault occurs.
The page fault routine will search for the virtual memory area responsible for the address and then call the page table handler.
The page table handler goes through each level (2 levels in x86 and 4 levels in x86_64), allocating entries if they're not present (2nd, 3rd and 4th), and then finally calls the real handler.
The real handler actually calls the function responsible for allocating page frames.

What is the behavior of MMU in case a page fault is not handled?

I was going through the do_page_fault (x86 arch) routine. Suppose a process tries to write to a shared page which is swapped out. Then as per the execution flow in do_page_fault, if the access is valid and it is a normal page (not huge page) and the execution lets say came till do_swap_page (i.e., no errors). Once do_swap_page is executed, it returns.
1) But will there be a fault again in case swap-in itself was not handled due to some reason?
2) In general, I would like to know more detail about MMU like - does it check pte flags or vm area flags to raise fault on an address? Can anyone point me to the sources where I can understand how MMU does the checks for a memory access.
1) But will there be a fault again in case swap-in itself was not handled due to some reason?
Yes. Fault will be generated again and again (ISR completes successfully), till page is in place. MMU doesn't track whether previous access to this page generated interrupt or not.
However, if page fault is triggered when previous fault is handled, double fault will be triggered.
2) In general, I would like to know more detail about MMU like - does it check pte flags or vm area flags to raise fault on an address? Can anyone point me to the sources where I can understand how MMU does the checks for a memory access.
Yes, it checks.
You may check OSDev for more info.

How an assembler instruction could not read the memory it is placed at

Using some software in Windows XP that works as a Windows service and doing a restart from the logon screen I see an infamous error message
The instruction at "00x..." referenced memory at "00x...". The memory
could not be read.
I reported the problem to the developers, but looking at the message once again, I noticed that the addresses are the same. So
The instruction at "00xdf3251" referenced memory at "00xdf3251". The memory
could not be read.
Whether this is a bug in the program or not, but what is the state of the memory/access rights or something else that prevents an instruction from reading the memory it is placed. Is it something specific to services?
I would guess there was an attempt to execute an instruction at the address 0xdf3251 and that location wasn't backed up by a readable and executable page of memory (perhaps, completely unmapped).
If that's the case, the exception (page fault, in fact) originates from that instruction and the exception handler has its address on the stack (the location to return to, in case the exception can be somehow resolved and the faulting instruction restarted when the handler returns). And that's the first address you're seeing.
The CR2 register that the page fault handler reads, which is the second address you're seeing, also has the same address because it has to contain the address of an inaccessible memory location irrespective of whether the page fault has been caused by:
complete absence of mapping (there's no page mapped at all)
lack of write permission (the page is read-only)
lack of execute permission (the page has the no-execute bit set) OR
lack of kernel privilege (the page is marked as accessible only in the kernel)
and irrespective of whether it was during a data access or while fetching an instruction (the latter being our case).
That's how you can get the instruction and memory access addresses equal.
Most likely the code had a bug resulting in a memory corruption and some pointer (or a return address on the stack) was overwritten with a bogus value pointing to an inaccessible memory location. And then one way or the other the CPU was directed to continue execution there (most likely using one of these instructions: jmp, call, ret). There's also a chance of having a race condition somewhere.
This kind of crash is most typically caused by stack corruption. A very common kind is a stack buffer overflow. Write too much data in an array stored on the stack and it overwrites a function's return address with the data. When the function then returns, it jumps to the bogus return address and the program falls over because there's no code at the address. They'll have a hard time fixing the bug since there's no easy way to find out where the corruption occurred.
This is a rather infamous kind of bug, it is a major attack vector for malware. Since it can commandeer a program to jump to arbitrary code with data. You ought to have a sitdown with these devs and point this out, it is a major security risk. The cure is easy enough, they should update their tools. Countermeasures against buffer overflow are built into the compilers these days.

Why there is no SIGSEGV signal on copy on write?

The copy-on-write article on wikipedia says that copy-on-write is usually implemented by giving read only access to the pages, so that when one is written, the page fault trap handler can map a unique physical memory page for it. So my question is why a user-level application doesn't receive a SIGSEGV signal when such page fault happens? Afterall, the wikipedia article on SIGSEGV says that SIGSEGV is the signal sent to a process when it makes an invalid memory reference, or segmentation fault. So in this case, that is on copy-on-write case, why no SIGSEGV is sent to the process.
I know it's been a while since this was asked, but I wanted to expand on Alexey's answer a bit.
Copy-on-write (I assume you're talking about virtual memory and not filesystems) usually works like so:
The OS knows which pages need to be copied on write. (They are the pages which are private to a process.) These pages are marked in hardware as read-only. However, the virtual memory map of the process has the pages marked as readable and writable. This means that the user process believes it has full access to the pages in question.
When a user process attempts to write to one of these pages, a page fault is generated because the processor recognizes that the page is read-only (based on the hardware marks before). Page faults are sort of like segfaults, but for the kernel instead of for user processes.
This triggers the page fault handler to run within the kernel, which looks at the page in question and sees that it's a private page which has not yet been copied. The handler will create a copy of the page and mark the copy as writable.
Then the handler will replace the old page's address with the new one in the virtual-to-physical translation table and exit.
The last instruction will be retried by the user process at this point, and this time the write will succeed because the new page is writeable at both the virtual memory map (the user process' view of memory permissions) and hardware (the kernel's view of memory permissions) levels.
A page fault is generated every time a segmentation fault occurs, but most page faults are handled by the kernel and are never passed to the process that caused them as segfaults. There are many reasons why a page fault might be handled at a lower level, including:
The page which was accessed was paged out to disk because it hadn't been used in a long time. The OS must bring it back into memory so the process can use it again.
The process is accessing a newly-allocated page for the first time, and the actual physical page hasn't been allocated yet. The OS must allocate a page and then insert it into the virtual-to-physical translation table before the memory can actually be used.
The OS is playing a hardware page access permissions trick to allow it to watch for accesses to a particular page. This is what happens in copy-on-write, but it can have other uses as well. Consider an OS-level virtualization technology like kvm, where writing to a memory-mapped device's location in memory in the guest OS should actually write to a file or the display in the host OS.
The main idea of COW is that COW is completely transparent to the user process as if it fully owned the memory without any sharing.

Resources