I am having trouble understanding which (if any) system calls cause a VM Exit to VMX root-mode under Intel VMX. I am specifically interested in network-related system calls (i.e. socket, accept, send, recv) as they require a "virtual" device. I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
Any clarification would be greatly appreciated.
According to the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 3, Chapter 22) none of int 0x80, sysenter and syscall, the three main instructions used under Linux to execute a system call, can cause VM exits per se. So in general there isn't a clear-cut way to tell which syscalls cause a VM exit and which ones don't.
VM exits can occur in a lot of scenarios, for example the host can configure an exception bitmap to decide which exceptions cause a VM exit, including page faults, so in theory almost any piece of code doing memory operations (kernel or user) could cause a VM exit.
Excluding such an extreme case and talking specifically about networking, as Peter Cordes suggests in the above comment, what you should be concerned about are operations that [may] send and receive data, since those will eventually require communication with the hardware (NIC):
Syscalls like socket, socketpair, {get,set}sockopt, bind, shutdown (etc.) should not cause VM exits since they do not require communication with the underlying hardware and they merely manipulate logical kernel data structures.
read, recv and write can cause VM exits unless the kernel already has data available to read or is waiting to accumulate enough data to write (e.g. as per Nagle's algorithm) before sending. Whether or not the kernel actually stops to read from HW or directly sends to HW depends on socket options, syscall flags and current state of the underlying socket/connection.
sendto, recvfrom, sendmsg, recvmsg (etc.), select, poll, epoll (etc.) on network sockets can all cause VM exits, again depending on the specific situation, pretty much the same reasoning as the previous point.
connect should not need to VM exit for datagram sockets (SOCK_DGRAM) as it merely sets a default address, but definitely can for connection-based protocols (e.g. SOCK_STREAM) as the kernel needs to send and receive packets to establish a connection.
accept also needs to send/receive data and therefore can cause VM exits.
I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
"In parallel" is not the term that I would use, but network I/O operations can be handled by the OS asynchronously, e.g. packets are not necessarily received or sent exactly when requested through a syscall, but when needed. For example, one or more VM exits needed to receive data could have already been performed before the guest userspace program issues the relative syscall.
Is it always necessary for a VM Exit to occur (if necessary) to send a packet on the NIC if on a multi-core system and there are available cores that could allow the VMM and a guest to run concurrently? I guess what I'm asking if increased parallelism could prevent VM Exits simply by allowing the hypervisor to run in parallel with a guest.
When a VM exit occurs the guest CPU is stopped and cannot resume execution until the VMM issues a VMRESUME for it (see Intel SDE Vol 3 Chapter 23.1 "Virtual Machine Control Structures Overview"). It is not possible to "prevent" a VM exit from occurring, however on a multi-processor system the VMM could theoretically run on multiple cores and delegate the handling of a VM exit to another VMM thread while resuming the stopped VM early.
So while increased parallelism cannot prevent VM exits, it could theoretically reduce their overhead. However do note that this can only happen for VM exits that can be handled "lazily" while resuming the guest. As an example, if the guest page-faults and VM-exits, the VMM cannot really "delegate" the handling of the VM exit and resume the guest earlier, since the guest will need the page fault to be resolved before resuming execution.
All in all, whenever the guest kernel needs to communicate with hardware, this can be a cause of VM exit. Access to emulated hardware for I/O operations requires the hypervisor to step in and therefore cause VM exits. There are however possible optimizations to consider:
Hardware passthrough can be used on systems which support IOMMU to make devices directly available to the guest OS and achieve very low overhead in HW communication with no need for VM exits. See Intel VT-d, Intel VT-c, SR-IOV, and also "PCI passthrough via OVMF" on ArchWiki.
Virtio is a standard for paravirtualization of network (NICs) and block devices (disks) which aims at reducing I/O overhead (i.e. overall number of needed VM exits), but needs support from both guest and host. The guest is "aware" of being a guest in this case. See also: Virtio for Linux/KVM.
Further reading:
x86 virtualization - Wikipedia
Virtual device passthrough for high speed VM networking - S. Garzarella, G. Lettieri, L. Rizzo
virtio: Towards a De-Facto Standard For Virtual I/O Devices - Rusty Russell
I am trying to understand how ring 3 to ring 0 transfer works in operating systems.
I think I understand how a syscall works.
My understanding is that when user mode program wants to make a syscall it will setup the call arguments and send an INT that will transfer over control to the OS which will then read the args, do that work and then return control back to user program. There Are more optimized sys enter arms sys exit variants as well.
All this makes sense to me If the user voluntarily calls the syscall.
However, to gurarantee safety OS cannot assume that callers will use syscall to access resources.
My question is — what happens if user program directly tries to access resource (disk) directly without involving OS.
How does the OS intercept it?
Any piece of I/O hardware, such as the disk controller, will (designer's choice) either respond to an I/O port address or a memory-space address, or possibly both. There is no other way to talk to the hardware. The hardware is sitting out on some bus. Program code must read/write some I/O port or must read/write some "memory" address which is really the device rather than actual RAM.
On x86, since the kernel controls access to both:
I/O ports, by setting or not setting the I/O port permissions, preventing ring 3 access
physical memory-space addresses (by controlling the virtual-to-physical address mapping)
then it can absolutely remove access from user mode.
So there is no instruction that user mode can execute that addresses the device. This is the fundamental aspect of the kernel/user split on any hardware: the kernel can control what user mode can do.
To pick up on a comment by #sawdust - once the kernel has set up the above restrictions, then:
an attempt to issue an I/O port instruction will trap to the kernel because access has not been granted.
access to memory-space device addresses is simply inexpressible; there is no user-space virtual address that equates to the particular physical address required.
I am writing some proof of concept code for KVM for communication between Windows 10 and the Host Linux system.
What I have is a virtual RAM device that is actually connected to a shared memory segment on the Host. The PCIe BAR 2 is a direct mapping to this RAM.
My intent is to provide a high bandwidth low latency means of transferring data that doesn't involve other common means used (sockets, etc). ZeroCopy would be ideal.
So far I have pretty much everything working, I have written a driver that calls MmAllocateMdlForIoSpace and then maps the memory using MmMapLockedPagesSpecifyCache to user mode via a DeviceIOControl. This works perfectly, the user mode application is able to address the shared memory and write to it.
What I am missing is the ability to use CreateFileMapping in user mode to obtain a HANDLE to a mapping of this memory. I am fairly new to windows driver programming and as such I am uncertain as to if this is even possible. Any pointers as to the best way to achieve this would be very helpful.
This is looks strange because after mmu enabled we operate with virtual addresses and don't use physical addresses.
I suppose that it is a hardening of the kernel.
Suppose that an attacker is able to corrupt a PTE.
If the physical location of the kernel is always known, then the attacker can immediately remap the page onto a suitable physical location and get code execution as a privileged user.
I think 'protection from DMA-capable devices' is not a valid answer.
If a malicious DMA-capable device has access to all of the physical memory, e.g. no protection through IOTLB, then the device can scrape memory and immediately find where the kernel is located in physical memory.
I am writing a kernel module in a guest operating system that will be run on a virtual machine using KVM. Here I want to allcoate a memory page at a particular physical address. kmalloc() gives me memory but at a physical address chosen by the OS.
Background : I am writing a device emulation technique in qemu that wouldn't exit when the guest communicates with the device (It exits, for example, in I/O mapped as well as port mapped devices). The basic idea is as follows : The guest device driver will write to a specific (guest) physical memory address. A thread in the qemu process will be polling it continuously to check for new data (through some status bits etc.). And will take action accordingly without causing an exit. Since there is no (existing) way by which guest can tell the host what address is being used by the device driver, I want a pre-specified memory page to be allocated for it.
You cannot allocate memory at a specific address, however, you can reserve certain physical addresses on boot time using reserve_bootmem(). Calling reserve_bootmem() early on boot (of course, it requires a modified kernel) will ensure that the reserved memory will not be passed on to the buddy system (i.e. alloc_pages() and higher level friends - kmalloc()), and you will be able to use that memory for any purpose.
It sounds like you should be attacking this from the other side, by having a physical memory range reserved in the memory map that the QEMU BIOS passes to the guest kernel at boot.