How does KVM/QEMU and guest OS handles page fault - linux-kernel

For example, I have a host OS (say, Ubuntu) with KVM enabled. I start a virtual machine with QEMU to run a guest OS (say, CentOS). It is said that to the host OS, this VM is just a process. So in the host's point of view, it handles page fault as usual (e.g., allocate page frame as needed, swap pages based on active/inactive lists if necessary).
Here is the question and my understanding. Within the guest OS, as it's still a full-fledged OS, I assume it still has all mechanisms handling virtual memory. It sees some virtualized physical memory provided by QEMU. By virtualized physical memory I mean the guest OS doesn't know it is in a VM, and still works as it would on a real physical machine, but what it has are indeed an abstraction given by QEMU. So even if a page frame is allocated to it, if that's not in guest's page table, the guest OS will still trigger a page fault and then map some page to the frame. What's worse, there may be a double page fault, where the guest first allocate some page frames upon page fault, which triggers page fault at host OS.
However, I also heard something like shallow (or shadow) page table which seems could optimize this unnecessary double page fault and double page table issue. I also looked at some other kernel implementation, specifically unikernels, e.g., OSv, IncludeOS, etc. I didn't find anything related to page fault and page table mechanisms. I did see some symbols like page_fault_handler but not as huge as what I saw in Linux kernel code. It seems memory management is not a big deal in these unikernel implementations. So I assume QEMU/KVM and some Intel's virtualization technologies have already handled that.
Any ideas in this topic? Or if you have some good references/papers/resources to this problem, or some hints would be very helpful.

There are two ways for QEMU/KVM to support guest physical memory: EPT and shadow page tables. (EPT is an Intel-defined mechanism. Other processors support something similar, which I won't talk about here.)
EPT stands for Extended Page Tables. It is a second level of paging supported by the CPU in addition to the regular processor page tables. While running in a VM, the regular page tables are used to translate Guest Virtual Addresses into Guest Physical Addresses, while the EPT tables are used to translate Guest Physical Addresses into Host Physical Addresses. This double-level translation is performed for every memory access within the guest. (The processor TLBs hide most of the cost.) EPT tables are managed by the VMM while the regular page tables are managed by the guest. If a page is not present in the guest page tables, it causes a page fault within the guest, exactly as you have described. If a page is present in the guest page tables but not present in the EPT, it causes an EPT violation VM exit, so the VMM can handle the missing page.
Shadow page tables are used when EPT is not available. Shadow page tables are a copy of the guest page tables which incorporate both the GVA to GPA and GPA to HPA mappings within a single set of page tables. When a page fault occurs, it always causes a VM exit. The VMM checks whether the missing page is mapped in the guest page tables. If it is not, then the VMM injects the page fault into the guest for it to handle. If the page is mapped in the guest page tables, then the VMM handles the fault as it would for an EPT violation. Efficient management of shadow page tables across multiple processes within the guest can be very complex.
EPT is both simpler to implement and has far better performance for most workloads, because page faults are generated directly to the guest OS, which is generally where they need to be handled. The use of shadow page tables requires a VM exit for every page fault. However, shadow page tables may have better performance for a few specific workloads that cause very few page faults.

Related

Why does Linux kaslr also randomize physical addresses?

This is looks strange because after mmu enabled we operate with virtual addresses and don't use physical addresses.
I suppose that it is a hardening of the kernel.
Suppose that an attacker is able to corrupt a PTE.
If the physical location of the kernel is always known, then the attacker can immediately remap the page onto a suitable physical location and get code execution as a privileged user.
I think 'protection from DMA-capable devices' is not a valid answer.
If a malicious DMA-capable device has access to all of the physical memory, e.g. no protection through IOTLB, then the device can scrape memory and immediately find where the kernel is located in physical memory.

Is LXC can be secure enough for IaaS?

I found on Debian Handbook some isolations limits about LXC.
Those limits are about :
Memory isolation
Shared filesystems
Kernel messages
Kernel compromission possibilities
For Memory isolation and filesystems, it does not seem to be a problem because it's possible to configure containers to isolate them. But there is a way to secure the Kernel enough to ensure an untrusted user can't compromise the kernel and can't read message kernel ?
If it's possible, is this restrained user access constraining for an IaaS ? Or is not it better to use real virtualization or para-virtualization to offer IaaS solutions ?
All the Linux containers still run under one kernel. If said kernel is compromised and since that kernel is running in the most privileged hardware mode (ring 0 for x86) it can affect every container running. With traditional hardware virtualization even if one guest kernel is compromised the hypervisor basically exists in another ring of protection (again x86 terminology) to isolate virtual guests. Of course it is possible to compromise the hypervisor assuming there is an error in its implementation, but compromising a virtual machine will not directly affect the other guests.
Also a compromised guest could indirectly affect the other guests via the (virtualized) network, i.e. sending malicious messages, but that is analogous to one machine in a network being compromised and doing the same to another machine, without virtualization. Furthermore, a compromised guest could start to affect the performance of the other machines via micro-architectural elements, e.g. thrashing the cache, or use said micro-architectural elements as a side channel attack to gleam some information about the other virtual machine.

how is page fault triggered in linux kernel

I understand linux kernel implements demand paging - the page is not allocated until it is first access. This is all handled in the page fault handler. But what I don't understand is how is the page fault triggered? More precisely, what triggers the call of the page fault handler? Is it from hardware?
The page fault is caused by the CPU (more specifically, the MMU) whenever the application tries to access a virtual memory address which isn't mapped to a physical address. The page fault handler (in the kernel) then checks whether the page is currently swapped to disk (swaps it back in) or has been reserved but not committed (commits it), then sends control back to the application to retry the memory access instruction. If the application doesn't have that virtual address allocated, on the other hand, then it sends the segfault instruction back to the kernel.
So it's most accurate to say that the hardware triggers the call.
when mapping onwards into memory that does not exist at all.( virtual to physical memory ).In this case, the MMU will say there is no corresponding physical memory and inform operating system which is known as a "page fault". The operating system tells it is a less used virtual memory and pls check it in disc .Then the page that the MMU was trying to find will be reloaded in place table. The memory map will be updated accordingly, then control will be given back user application at the exact point the page fault occured and perform that instruction again, only this time the MMU will output the
correct address to the memory system, and all will continue.
Since page fault triggered by MMU which is part of hardware is responsible for it.

Can I allocate memory pages at a specified physical address in a kernel module?

I am writing a kernel module in a guest operating system that will be run on a virtual machine using KVM. Here I want to allcoate a memory page at a particular physical address. kmalloc() gives me memory but at a physical address chosen by the OS.
Background : I am writing a device emulation technique in qemu that wouldn't exit when the guest communicates with the device (It exits, for example, in I/O mapped as well as port mapped devices). The basic idea is as follows : The guest device driver will write to a specific (guest) physical memory address. A thread in the qemu process will be polling it continuously to check for new data (through some status bits etc.). And will take action accordingly without causing an exit. Since there is no (existing) way by which guest can tell the host what address is being used by the device driver, I want a pre-specified memory page to be allocated for it.
You cannot allocate memory at a specific address, however, you can reserve certain physical addresses on boot time using reserve_bootmem(). Calling reserve_bootmem() early on boot (of course, it requires a modified kernel) will ensure that the reserved memory will not be passed on to the buddy system (i.e. alloc_pages() and higher level friends - kmalloc()), and you will be able to use that memory for any purpose.
It sounds like you should be attacking this from the other side, by having a physical memory range reserved in the memory map that the QEMU BIOS passes to the guest kernel at boot.

Windows processes in kernel vs system

I have a few questions related to Windows processes in kernel and usermode.
If I have a hello world application, and a hello world driver that exposes a new system call, foo(), I am curious about what I can and can't do once I am in kernel mode.
For starters, when I write my new hello world app, I am given a new process, which means I have my own user mode VM space (lets keep it simple, 32 bit windows). So I have 2GB of space that I "own", I can poke and peek until my hearts content. However, I am bound by my process. I can't (lets not bring shared memory into this yet) touch anyone elses memory.
If, I write this hello world driver, and call it from my user app, I (the driver code) is now in kernel mode.
First clarification/questions:
I am STILL in the same process as the user mode app, correct? Still have the same PID?
Memory Questions:
Memory is presented to my process as VM, that is even if I have 1GB of RAM, I can still access 4GB of memory (2GB user / 2GB of kernel - not minding details of switches on servers, or specifics, just a general assumption here).
As a user process, I cannot peek at any kernel mode memory address, but I can do whatever I want to the user space, correct?
If I call into my hello world driver, from the driver code, do I still have the same view of the usermode memory? But now I also have access to any memory in kernel mode?
Is this kernel mode memory SHARED (unlike User mode, which is my own processes copy)? That is, writing a driver is more like writing a threaded application for a single process that is the OS (scheduling aside?)
Next question. As a driver, could I change the process that I am running. Say, I knew another app (say, a usermode webserver), and load the VM for that process, change it's instruction pointer, stack, or even load different code into the process, and then switch back to my own app? (I am not trying to do anything nefarious here, I am just curious what it really means to be in kernel mode)?
Also, once in kernel mode, can I prevent the OS from preempting me? I think (in Windows) you can set your IRQL level to do this, but I don't fully understand this, even after reading Solomons book (Inside Windows...). I will ask another question, directly related to IRQL/DPCs but, for now, I would love to know if a kernel driver has the power to set an IRQL to High and take over the system.
More to come, but answers to these questions would help.
Each process has a "context" that, among other things, contains the VM mappings specific to that process (<2 GB normally in 32bit mode). When thread executing in user mode enteres kernel mode (e.g. from a system call or IO request), the same thread is still executing, in the process, with the same context. PsGetCurrentProcessId will return the same thing at this point as GetCurrentProcessID would have just before in user mode (same with thread IDs).
The user memory mappings that came with the context are still in place upon entering kernel mode: you can access user memory from kernel mode directly. There are special things that need to be done for this to be safe though: Using Neither Buffered Nor Direct I/O. In particular, an invalid address access attempt in the user space range will raise a SEH exception that needs to be caught, and the contents of user memory can change at any time due to the action of another thread in that process. Accessing an invalid address in the kernel address range causes a bugcheck. A thread executing in user mode cannot access any kernel memory.
Kernel address space is not part of a process's context, so is mapped the same between all of them. However, any number of threads may be active in kernel mode at any one time, so it is not like a single threaded application. In general, threads service their own system calls upon entering kernel mode (as opposed to having dedicated kernel worker threads to handle all requests).
The underlying structures that save thread and process state is all available in kernel mode. Mapping the VM of another process is best done ahead of time from the other process by creating an MDL from that process and mapping it into system address space. If you just want to alter the context of another thread, this can be done entirely from user mode. Note that a thread must be suspended to change its context without having a race condition. Loading a module into a process from kernel mode is ill advised; all of the loader APIs are designed for use from user mode only.
Each CPU has a current IRQL that it is running at. It determines what things can interrupt what the CPU is currently doing. Only an event from a higher IRQL can preempt the CPU's current activity.
PASSIVE_LEVEL is where all user code and most kernel code executes. Many kernel APIs require the IRQL to be PASSIVE_LEVEL
APC_LEVEL is used for kernel APCs
DISPATCH_LEVEL is for scheduler events (known as the dispatcher in NT terminology). Running at this level will prevent you from being preempted by the scheduler. Note that it is not safe to have any kind of page fault at this level; there would be a deadlock possibility with the memory manager trying to retrieve pages. The kernel will bugcheck immediately if it has a page fault at DISPATCH_LEVEL or higher. This means that you can't safely access paged pool, paged code segments or any user memory that hasn't been locked (i.e. by an MDL).
Above this are levels connected to hardware device interrupt levels, known as DIRQL.
The highest level is HIGH_LEVEL. Nothing can preempt this level. It's used by the kernel during a bugcheck to halt the system.
I recommend reading Scheduling, Thread Context, and IRQL
A good primer for this topic would be found at: http://www.codinghorror.com/blog/archives/001029.html
As Jeff points out for the user mode memory space:
"In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode."
So your app will have no access to the Kernel Mode memory, infact your communication with the driver is probably through IOCTLs (i.e. IRPs).
The kernel however has access to everything, including to mappings for your user mode processes. This is a one way street, user mode cannot map into kernel mode for security and stability reasons. Even through kernel mode drivers can map into user mode memory I would advise against it.
At least that's the way it was back before WDF. I am not sure of the capabilities of memory mapping with user mode drivers.
See also: http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2Fe%2Fb%2Fa%2Feba1050f-a31d-436b-9281-92cdfeae4b45%2FKM-UMGuide.doc&ei=eAygSvfuAt7gnQe01P3gDQ&rct=j&q=user+mode+mapping+into+kernel+mode&usg=AFQjCNG1QYQMcIpcokMoQSWJlGSEodaBHQ

Resources