Interrupt handler and virtual memory - cpu

Does interrupt handler is running like user programs in the meaning of virtual memory (TLB miss - load page descriptor) or there are on any CPU difference solution?

The interrupt service routine (ISR) is going to execute in kernel mode. The jump table that the processor uses to figure out what routine to run on the interrupt itself cannot be swapped out, because the page fault handler would also be found there. I don't know for sure what would happen if the handler address pointed to an unmapped region of memory. Virtual memory can be supported in kernel mode, at least on x86. Maybe some architectures could handle an access fault for an ISR address, but an OS would never implement that, because the latency for entering the ISR would be totally unacceptable.

Related

Can we safely access user-mode buffer passed as argument in syscalls, or do we need to use copy_from_user? (Linux)

Let's assume I'm hooking a syscall such as openat in Linux using an LKM. Now coming from a Windows driver development world, i have two questions:
Considering that the file path is passed as an argument in openat, Can we safely directly access that buffer (read or write), which is a user-mode address, without using copy_from_user? Because in Windows, certain callbacks are guaranteed to be called at the context of the calling process, thus we can safely assume that the buffer address is correct and not for another context, and even if they are not in the same context, we can attach to the calling process.
In Windows, there is a concept of paged out memory and IRQL, which means that on callbacks that have IRQL of DISPATCH or above, we cannot access memory that is paged out. Is there a similar concept in Linux? For example what happens if the user-mode address is paged out and i try to access it? Can i catch the exception? In Windows we can catch the exception for accessing invalid user-mode addresses, but invalid kernel-mode address access cannot be caught and it causes a BSOD.
I'm asking this because i need to hook opennat to monitor some file opens, but the problem is that if i try to allocate a kernel buffer and then use copy_from_user to get the file path, this will take a lot of time, but if i can somehow safely access the user-mode buffer directly without any copy and read or write to it, this will save up a lot of time. In Windows, even if the context is invalid, we can still attach to that process, lock the user-mode buffer with the help of MDL, and read/write to it, but I'm not sure if this is possible in Linux or not.

Which networking syscalls cause VM Exits to Hypervisor in Intel VMX?

I am having trouble understanding which (if any) system calls cause a VM Exit to VMX root-mode under Intel VMX. I am specifically interested in network-related system calls (i.e. socket, accept, send, recv) as they require a "virtual" device. I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
Any clarification would be greatly appreciated.
According to the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 3, Chapter 22) none of int 0x80, sysenter and syscall, the three main instructions used under Linux to execute a system call, can cause VM exits per se. So in general there isn't a clear-cut way to tell which syscalls cause a VM exit and which ones don't.
VM exits can occur in a lot of scenarios, for example the host can configure an exception bitmap to decide which exceptions cause a VM exit, including page faults, so in theory almost any piece of code doing memory operations (kernel or user) could cause a VM exit.
Excluding such an extreme case and talking specifically about networking, as Peter Cordes suggests in the above comment, what you should be concerned about are operations that [may] send and receive data, since those will eventually require communication with the hardware (NIC):
Syscalls like socket, socketpair, {get,set}sockopt, bind, shutdown (etc.) should not cause VM exits since they do not require communication with the underlying hardware and they merely manipulate logical kernel data structures.
read, recv and write can cause VM exits unless the kernel already has data available to read or is waiting to accumulate enough data to write (e.g. as per Nagle's algorithm) before sending. Whether or not the kernel actually stops to read from HW or directly sends to HW depends on socket options, syscall flags and current state of the underlying socket/connection.
sendto, recvfrom, sendmsg, recvmsg (etc.), select, poll, epoll (etc.) on network sockets can all cause VM exits, again depending on the specific situation, pretty much the same reasoning as the previous point.
connect should not need to VM exit for datagram sockets (SOCK_DGRAM) as it merely sets a default address, but definitely can for connection-based protocols (e.g. SOCK_STREAM) as the kernel needs to send and receive packets to establish a connection.
accept also needs to send/receive data and therefore can cause VM exits.
I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
"In parallel" is not the term that I would use, but network I/O operations can be handled by the OS asynchronously, e.g. packets are not necessarily received or sent exactly when requested through a syscall, but when needed. For example, one or more VM exits needed to receive data could have already been performed before the guest userspace program issues the relative syscall.
Is it always necessary for a VM Exit to occur (if necessary) to send a packet on the NIC if on a multi-core system and there are available cores that could allow the VMM and a guest to run concurrently? I guess what I'm asking if increased parallelism could prevent VM Exits simply by allowing the hypervisor to run in parallel with a guest.
When a VM exit occurs the guest CPU is stopped and cannot resume execution until the VMM issues a VMRESUME for it (see Intel SDE Vol 3 Chapter 23.1 "Virtual Machine Control Structures Overview"). It is not possible to "prevent" a VM exit from occurring, however on a multi-processor system the VMM could theoretically run on multiple cores and delegate the handling of a VM exit to another VMM thread while resuming the stopped VM early.
So while increased parallelism cannot prevent VM exits, it could theoretically reduce their overhead. However do note that this can only happen for VM exits that can be handled "lazily" while resuming the guest. As an example, if the guest page-faults and VM-exits, the VMM cannot really "delegate" the handling of the VM exit and resume the guest earlier, since the guest will need the page fault to be resolved before resuming execution.
All in all, whenever the guest kernel needs to communicate with hardware, this can be a cause of VM exit. Access to emulated hardware for I/O operations requires the hypervisor to step in and therefore cause VM exits. There are however possible optimizations to consider:
Hardware passthrough can be used on systems which support IOMMU to make devices directly available to the guest OS and achieve very low overhead in HW communication with no need for VM exits. See Intel VT-d, Intel VT-c, SR-IOV, and also "PCI passthrough via OVMF" on ArchWiki.
Virtio is a standard for paravirtualization of network (NICs) and block devices (disks) which aims at reducing I/O overhead (i.e. overall number of needed VM exits), but needs support from both guest and host. The guest is "aware" of being a guest in this case. See also: Virtio for Linux/KVM.
Further reading:
x86 virtualization - Wikipedia
Virtual device passthrough for high speed VM networking - S. Garzarella, G. Lettieri, L. Rizzo
virtio: Towards a De-Facto Standard For Virtual I/O Devices - Rusty Russell

how is page fault triggered in linux kernel

I understand linux kernel implements demand paging - the page is not allocated until it is first access. This is all handled in the page fault handler. But what I don't understand is how is the page fault triggered? More precisely, what triggers the call of the page fault handler? Is it from hardware?
The page fault is caused by the CPU (more specifically, the MMU) whenever the application tries to access a virtual memory address which isn't mapped to a physical address. The page fault handler (in the kernel) then checks whether the page is currently swapped to disk (swaps it back in) or has been reserved but not committed (commits it), then sends control back to the application to retry the memory access instruction. If the application doesn't have that virtual address allocated, on the other hand, then it sends the segfault instruction back to the kernel.
So it's most accurate to say that the hardware triggers the call.
when mapping onwards into memory that does not exist at all.( virtual to physical memory ).In this case, the MMU will say there is no corresponding physical memory and inform operating system which is known as a "page fault". The operating system tells it is a less used virtual memory and pls check it in disc .Then the page that the MMU was trying to find will be reloaded in place table. The memory map will be updated accordingly, then control will be given back user application at the exact point the page fault occured and perform that instruction again, only this time the MMU will output the
correct address to the memory system, and all will continue.
Since page fault triggered by MMU which is part of hardware is responsible for it.

Restriction on interrupt routines in linux kernel drivers

Every device driver book talks about not using functions that sleep in interrupt routines.
What issues occur by calling these functions from ISRs ?
A total lockdown of the kernel is the issue here. The kernel is in interrupt context when executing interrupt handlers, that is, the interrupt handler is not associated with any process (the current macro cannot be used).
If you are able to sleep, you would never be able to get back to the interrupted code, since the scheduler would not know how to get back to it.
Holding a lock in the interrupt handler, and then sleeping, allowing another process to run and then entering the interrupt handler again and trying to re-acquire the lock would deadlock the kernel.
If you try to read more about how the scheduling in the kernel works, you will soon realize why sleeping is a no go in certain contexts.

Windows processes in kernel vs system

I have a few questions related to Windows processes in kernel and usermode.
If I have a hello world application, and a hello world driver that exposes a new system call, foo(), I am curious about what I can and can't do once I am in kernel mode.
For starters, when I write my new hello world app, I am given a new process, which means I have my own user mode VM space (lets keep it simple, 32 bit windows). So I have 2GB of space that I "own", I can poke and peek until my hearts content. However, I am bound by my process. I can't (lets not bring shared memory into this yet) touch anyone elses memory.
If, I write this hello world driver, and call it from my user app, I (the driver code) is now in kernel mode.
First clarification/questions:
I am STILL in the same process as the user mode app, correct? Still have the same PID?
Memory Questions:
Memory is presented to my process as VM, that is even if I have 1GB of RAM, I can still access 4GB of memory (2GB user / 2GB of kernel - not minding details of switches on servers, or specifics, just a general assumption here).
As a user process, I cannot peek at any kernel mode memory address, but I can do whatever I want to the user space, correct?
If I call into my hello world driver, from the driver code, do I still have the same view of the usermode memory? But now I also have access to any memory in kernel mode?
Is this kernel mode memory SHARED (unlike User mode, which is my own processes copy)? That is, writing a driver is more like writing a threaded application for a single process that is the OS (scheduling aside?)
Next question. As a driver, could I change the process that I am running. Say, I knew another app (say, a usermode webserver), and load the VM for that process, change it's instruction pointer, stack, or even load different code into the process, and then switch back to my own app? (I am not trying to do anything nefarious here, I am just curious what it really means to be in kernel mode)?
Also, once in kernel mode, can I prevent the OS from preempting me? I think (in Windows) you can set your IRQL level to do this, but I don't fully understand this, even after reading Solomons book (Inside Windows...). I will ask another question, directly related to IRQL/DPCs but, for now, I would love to know if a kernel driver has the power to set an IRQL to High and take over the system.
More to come, but answers to these questions would help.
Each process has a "context" that, among other things, contains the VM mappings specific to that process (<2 GB normally in 32bit mode). When thread executing in user mode enteres kernel mode (e.g. from a system call or IO request), the same thread is still executing, in the process, with the same context. PsGetCurrentProcessId will return the same thing at this point as GetCurrentProcessID would have just before in user mode (same with thread IDs).
The user memory mappings that came with the context are still in place upon entering kernel mode: you can access user memory from kernel mode directly. There are special things that need to be done for this to be safe though: Using Neither Buffered Nor Direct I/O. In particular, an invalid address access attempt in the user space range will raise a SEH exception that needs to be caught, and the contents of user memory can change at any time due to the action of another thread in that process. Accessing an invalid address in the kernel address range causes a bugcheck. A thread executing in user mode cannot access any kernel memory.
Kernel address space is not part of a process's context, so is mapped the same between all of them. However, any number of threads may be active in kernel mode at any one time, so it is not like a single threaded application. In general, threads service their own system calls upon entering kernel mode (as opposed to having dedicated kernel worker threads to handle all requests).
The underlying structures that save thread and process state is all available in kernel mode. Mapping the VM of another process is best done ahead of time from the other process by creating an MDL from that process and mapping it into system address space. If you just want to alter the context of another thread, this can be done entirely from user mode. Note that a thread must be suspended to change its context without having a race condition. Loading a module into a process from kernel mode is ill advised; all of the loader APIs are designed for use from user mode only.
Each CPU has a current IRQL that it is running at. It determines what things can interrupt what the CPU is currently doing. Only an event from a higher IRQL can preempt the CPU's current activity.
PASSIVE_LEVEL is where all user code and most kernel code executes. Many kernel APIs require the IRQL to be PASSIVE_LEVEL
APC_LEVEL is used for kernel APCs
DISPATCH_LEVEL is for scheduler events (known as the dispatcher in NT terminology). Running at this level will prevent you from being preempted by the scheduler. Note that it is not safe to have any kind of page fault at this level; there would be a deadlock possibility with the memory manager trying to retrieve pages. The kernel will bugcheck immediately if it has a page fault at DISPATCH_LEVEL or higher. This means that you can't safely access paged pool, paged code segments or any user memory that hasn't been locked (i.e. by an MDL).
Above this are levels connected to hardware device interrupt levels, known as DIRQL.
The highest level is HIGH_LEVEL. Nothing can preempt this level. It's used by the kernel during a bugcheck to halt the system.
I recommend reading Scheduling, Thread Context, and IRQL
A good primer for this topic would be found at: http://www.codinghorror.com/blog/archives/001029.html
As Jeff points out for the user mode memory space:
"In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode."
So your app will have no access to the Kernel Mode memory, infact your communication with the driver is probably through IOCTLs (i.e. IRPs).
The kernel however has access to everything, including to mappings for your user mode processes. This is a one way street, user mode cannot map into kernel mode for security and stability reasons. Even through kernel mode drivers can map into user mode memory I would advise against it.
At least that's the way it was back before WDF. I am not sure of the capabilities of memory mapping with user mode drivers.
See also: http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2Fe%2Fb%2Fa%2Feba1050f-a31d-436b-9281-92cdfeae4b45%2FKM-UMGuide.doc&ei=eAygSvfuAt7gnQe01P3gDQ&rct=j&q=user+mode+mapping+into+kernel+mode&usg=AFQjCNG1QYQMcIpcokMoQSWJlGSEodaBHQ

Resources