How is DMA cache coherency kept on Intel chipsets? - caching

I was reading something a few months ago about windows chipset iterations and PCH upgrades between them and I'm pretty sure I saw something on DMA cache coherency and that it involves the home agent or QHL (Nehalem) but I can't find it now.
So I ask if anyone knows the details of any method of DMA cache coherency that has been employed by Intel and how it works.
Nehalem's global queue on the optimisation manual:
Cacheline requests from the cores or from a remote package or the
I/O Hub are handled by the GQ.
The global queue checks to see if the line is on the package and if it is, it snoops the appropriate cores using the core valid bits. If this is a dual socket system then the request will be sent to the QHL (Home agent on SnB) if home snoop is being used which will then send to the QPI link that the NUMA node bitmap refers to. If source snoop is being used then the GQ will check its own 2 bit i/o directory cache in order to generate a message for the correct QPI link the QHL (QPI agent on SnB) must generate another message to the correct LLC that has been assigned that address range. I'm not sure what happens on COD mode on Haswell or SNC on the mesh architecture.

Related

Which networking syscalls cause VM Exits to Hypervisor in Intel VMX?

I am having trouble understanding which (if any) system calls cause a VM Exit to VMX root-mode under Intel VMX. I am specifically interested in network-related system calls (i.e. socket, accept, send, recv) as they require a "virtual" device. I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
Any clarification would be greatly appreciated.
According to the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 3, Chapter 22) none of int 0x80, sysenter and syscall, the three main instructions used under Linux to execute a system call, can cause VM exits per se. So in general there isn't a clear-cut way to tell which syscalls cause a VM exit and which ones don't.
VM exits can occur in a lot of scenarios, for example the host can configure an exception bitmap to decide which exceptions cause a VM exit, including page faults, so in theory almost any piece of code doing memory operations (kernel or user) could cause a VM exit.
Excluding such an extreme case and talking specifically about networking, as Peter Cordes suggests in the above comment, what you should be concerned about are operations that [may] send and receive data, since those will eventually require communication with the hardware (NIC):
Syscalls like socket, socketpair, {get,set}sockopt, bind, shutdown (etc.) should not cause VM exits since they do not require communication with the underlying hardware and they merely manipulate logical kernel data structures.
read, recv and write can cause VM exits unless the kernel already has data available to read or is waiting to accumulate enough data to write (e.g. as per Nagle's algorithm) before sending. Whether or not the kernel actually stops to read from HW or directly sends to HW depends on socket options, syscall flags and current state of the underlying socket/connection.
sendto, recvfrom, sendmsg, recvmsg (etc.), select, poll, epoll (etc.) on network sockets can all cause VM exits, again depending on the specific situation, pretty much the same reasoning as the previous point.
connect should not need to VM exit for datagram sockets (SOCK_DGRAM) as it merely sets a default address, but definitely can for connection-based protocols (e.g. SOCK_STREAM) as the kernel needs to send and receive packets to establish a connection.
accept also needs to send/receive data and therefore can cause VM exits.
I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
"In parallel" is not the term that I would use, but network I/O operations can be handled by the OS asynchronously, e.g. packets are not necessarily received or sent exactly when requested through a syscall, but when needed. For example, one or more VM exits needed to receive data could have already been performed before the guest userspace program issues the relative syscall.
Is it always necessary for a VM Exit to occur (if necessary) to send a packet on the NIC if on a multi-core system and there are available cores that could allow the VMM and a guest to run concurrently? I guess what I'm asking if increased parallelism could prevent VM Exits simply by allowing the hypervisor to run in parallel with a guest.
When a VM exit occurs the guest CPU is stopped and cannot resume execution until the VMM issues a VMRESUME for it (see Intel SDE Vol 3 Chapter 23.1 "Virtual Machine Control Structures Overview"). It is not possible to "prevent" a VM exit from occurring, however on a multi-processor system the VMM could theoretically run on multiple cores and delegate the handling of a VM exit to another VMM thread while resuming the stopped VM early.
So while increased parallelism cannot prevent VM exits, it could theoretically reduce their overhead. However do note that this can only happen for VM exits that can be handled "lazily" while resuming the guest. As an example, if the guest page-faults and VM-exits, the VMM cannot really "delegate" the handling of the VM exit and resume the guest earlier, since the guest will need the page fault to be resolved before resuming execution.
All in all, whenever the guest kernel needs to communicate with hardware, this can be a cause of VM exit. Access to emulated hardware for I/O operations requires the hypervisor to step in and therefore cause VM exits. There are however possible optimizations to consider:
Hardware passthrough can be used on systems which support IOMMU to make devices directly available to the guest OS and achieve very low overhead in HW communication with no need for VM exits. See Intel VT-d, Intel VT-c, SR-IOV, and also "PCI passthrough via OVMF" on ArchWiki.
Virtio is a standard for paravirtualization of network (NICs) and block devices (disks) which aims at reducing I/O overhead (i.e. overall number of needed VM exits), but needs support from both guest and host. The guest is "aware" of being a guest in this case. See also: Virtio for Linux/KVM.
Further reading:
x86 virtualization - Wikipedia
Virtual device passthrough for high speed VM networking - S. Garzarella, G. Lettieri, L. Rizzo
virtio: Towards a De-Facto Standard For Virtual I/O Devices - Rusty Russell

How to guarantee all DMA data write into ram when getting msi interrupt?

There are some questions that make me confused:
Msi interrupt is a memory write request. Can msi ensure that all DMA data have been written into ram? or only ensure that the data has been transferred completely on pci bridge?
If msi interrupt only ensures the data transfer completely on pci bridge. How to guarantee all DMA data write into ram when getting msi interrupt?
Does msi memory write request really write into ram?
Thanks in advance.
Ensuring that DMA data have been written to the bus prior to the MSI write is the responsibility of the device. The device should not issue the MSI write until everything that the driver/OS needs to see with respect to the device request has been done, whether that entails memory reads, memory writes or whatever else. But, assuming things have been done in the appropriate order (on the bus) by the device (DMA write(s), then MSI write), it is then up to the host bridge to ensure that the data is written to RAM in the correct order. But typically the MSI write itself has nothing to do with any guarantees. The host bridge simply ensures that its memory transactions are executed in the order given (and the memory subsystem ensures coherence among all the CPUs and peripherals so that the data appear to have been written to memory in the correct order even if there are caches and such).
As for your question 3, the MSI write goes to wherever the device is told to send it when you setup MSI in the device registers. Typically, that "MSI memory write" is directed to an address associated with the system interrupt controller and not to actual RAM, but it's the OS/driver responsibility to configure the correct address.

Linux PCIe DMA driver

I'm currently writing a driver for a PCIe device that should send data to a Linux system using DMA. As far as I can understand my PCIe device needs a DMA controller (DMA master) and my Linux system too (DMA slave). Currently the PCIe device has no DMA controller and should not get one. That confuses me.
A. Is the following possible?
PCIe device sends interrupt
Wait for interrupt in the Linux driver
Start DMA transfer from memory mapped PCIe registers to Linux system DMA.
Read the data from memory in userspace
I have everything setup for this, the only thing I miss is how to transfer the data from the PCIe registers to the memory.
B. Which system call (or series of) do I need to call to do a DMA transfer?
C. I probably need to setup the DMA on the Linux system but what I find points to code that assumes there is a slave, e.g. struct dma_slave_config.
The use case is collecting data from the PCIe device and make it available in memory to userspace.
Any help is much appreciated. Thanks in advance!
DMA, by definition, is completely independent of the CPU and any software (i.e. OS kernel) running on it. DMA is a way for devices to perform memory reads and writes against host memory without the involvement of the host CPU.
The way DMA usually works is something like this: software will allocate a DMA accessible region in memory and share the physical address with the device, say, by performing memory writes against the address space associated with one of the device's BARs. Then, the device will perform a DMA read or write against that block of memory. When that operation is complete, the device will issue an interrupt to the device driver so it can handle the data and/or free the memory.
If your device does not have the capability of issuing a DMA read or write against host memory, then you'll have to interact with it with using the CPU only. Discrete DMA controllers have not been a thing for a very long time.

Mapping IO space to UserMode via CreateFileMapping

I am writing some proof of concept code for KVM for communication between Windows 10 and the Host Linux system.
What I have is a virtual RAM device that is actually connected to a shared memory segment on the Host. The PCIe BAR 2 is a direct mapping to this RAM.
My intent is to provide a high bandwidth low latency means of transferring data that doesn't involve other common means used (sockets, etc). ZeroCopy would be ideal.
So far I have pretty much everything working, I have written a driver that calls MmAllocateMdlForIoSpace and then maps the memory using MmMapLockedPagesSpecifyCache to user mode via a DeviceIOControl. This works perfectly, the user mode application is able to address the shared memory and write to it.
What I am missing is the ability to use CreateFileMapping in user mode to obtain a HANDLE to a mapping of this memory. I am fairly new to windows driver programming and as such I am uncertain as to if this is even possible. Any pointers as to the best way to achieve this would be very helpful.

Is it possible to set the dma buffer address for a network card?

My understanding of network cards is that when receiving data, that data is DMA'd into main memory through the network card driver. The kernel then copies this memory into user space and sends any necessary messages.
My question is, in Windows, is it possible to set the address that the DMA is writing to? My goal is to eliminate the extra memory copy similar to the way NVidia's GPUDirect pipeline works.
Yes, this is possible. I believe this is called "common buffer DMA". It is used for intelligent network adapters. Taking advantage of this would require writing your own network driver. Here is some microsoft documentation on it. http://msdn.microsoft.com/en-us/library/windows/hardware/ff565359%28v=vs.85%29.aspx

Resources