How to guarantee all DMA data write into ram when getting msi interrupt? - linux-kernel

There are some questions that make me confused:
Msi interrupt is a memory write request. Can msi ensure that all DMA data have been written into ram? or only ensure that the data has been transferred completely on pci bridge?
If msi interrupt only ensures the data transfer completely on pci bridge. How to guarantee all DMA data write into ram when getting msi interrupt?
Does msi memory write request really write into ram?
Thanks in advance.

Ensuring that DMA data have been written to the bus prior to the MSI write is the responsibility of the device. The device should not issue the MSI write until everything that the driver/OS needs to see with respect to the device request has been done, whether that entails memory reads, memory writes or whatever else. But, assuming things have been done in the appropriate order (on the bus) by the device (DMA write(s), then MSI write), it is then up to the host bridge to ensure that the data is written to RAM in the correct order. But typically the MSI write itself has nothing to do with any guarantees. The host bridge simply ensures that its memory transactions are executed in the order given (and the memory subsystem ensures coherence among all the CPUs and peripherals so that the data appear to have been written to memory in the correct order even if there are caches and such).
As for your question 3, the MSI write goes to wherever the device is told to send it when you setup MSI in the device registers. Typically, that "MSI memory write" is directed to an address associated with the system interrupt controller and not to actual RAM, but it's the OS/driver responsibility to configure the correct address.

Related

Linux PCIe DMA driver

I'm currently writing a driver for a PCIe device that should send data to a Linux system using DMA. As far as I can understand my PCIe device needs a DMA controller (DMA master) and my Linux system too (DMA slave). Currently the PCIe device has no DMA controller and should not get one. That confuses me.
A. Is the following possible?
PCIe device sends interrupt
Wait for interrupt in the Linux driver
Start DMA transfer from memory mapped PCIe registers to Linux system DMA.
Read the data from memory in userspace
I have everything setup for this, the only thing I miss is how to transfer the data from the PCIe registers to the memory.
B. Which system call (or series of) do I need to call to do a DMA transfer?
C. I probably need to setup the DMA on the Linux system but what I find points to code that assumes there is a slave, e.g. struct dma_slave_config.
The use case is collecting data from the PCIe device and make it available in memory to userspace.
Any help is much appreciated. Thanks in advance!
DMA, by definition, is completely independent of the CPU and any software (i.e. OS kernel) running on it. DMA is a way for devices to perform memory reads and writes against host memory without the involvement of the host CPU.
The way DMA usually works is something like this: software will allocate a DMA accessible region in memory and share the physical address with the device, say, by performing memory writes against the address space associated with one of the device's BARs. Then, the device will perform a DMA read or write against that block of memory. When that operation is complete, the device will issue an interrupt to the device driver so it can handle the data and/or free the memory.
If your device does not have the capability of issuing a DMA read or write against host memory, then you'll have to interact with it with using the CPU only. Discrete DMA controllers have not been a thing for a very long time.

How is DMA cache coherency kept on Intel chipsets?

I was reading something a few months ago about windows chipset iterations and PCH upgrades between them and I'm pretty sure I saw something on DMA cache coherency and that it involves the home agent or QHL (Nehalem) but I can't find it now.
So I ask if anyone knows the details of any method of DMA cache coherency that has been employed by Intel and how it works.
Nehalem's global queue on the optimisation manual:
Cacheline requests from the cores or from a remote package or the
I/O Hub are handled by the GQ.
The global queue checks to see if the line is on the package and if it is, it snoops the appropriate cores using the core valid bits. If this is a dual socket system then the request will be sent to the QHL (Home agent on SnB) if home snoop is being used which will then send to the QPI link that the NUMA node bitmap refers to. If source snoop is being used then the GQ will check its own 2 bit i/o directory cache in order to generate a message for the correct QPI link the QHL (QPI agent on SnB) must generate another message to the correct LLC that has been assigned that address range. I'm not sure what happens on COD mode on Haswell or SNC on the mesh architecture.

Mapping IO space to UserMode via CreateFileMapping

I am writing some proof of concept code for KVM for communication between Windows 10 and the Host Linux system.
What I have is a virtual RAM device that is actually connected to a shared memory segment on the Host. The PCIe BAR 2 is a direct mapping to this RAM.
My intent is to provide a high bandwidth low latency means of transferring data that doesn't involve other common means used (sockets, etc). ZeroCopy would be ideal.
So far I have pretty much everything working, I have written a driver that calls MmAllocateMdlForIoSpace and then maps the memory using MmMapLockedPagesSpecifyCache to user mode via a DeviceIOControl. This works perfectly, the user mode application is able to address the shared memory and write to it.
What I am missing is the ability to use CreateFileMapping in user mode to obtain a HANDLE to a mapping of this memory. I am fairly new to windows driver programming and as such I am uncertain as to if this is even possible. Any pointers as to the best way to achieve this would be very helpful.

Is it possible to set the dma buffer address for a network card?

My understanding of network cards is that when receiving data, that data is DMA'd into main memory through the network card driver. The kernel then copies this memory into user space and sends any necessary messages.
My question is, in Windows, is it possible to set the address that the DMA is writing to? My goal is to eliminate the extra memory copy similar to the way NVidia's GPUDirect pipeline works.
Yes, this is possible. I believe this is called "common buffer DMA". It is used for intelligent network adapters. Taking advantage of this would require writing your own network driver. Here is some microsoft documentation on it. http://msdn.microsoft.com/en-us/library/windows/hardware/ff565359%28v=vs.85%29.aspx

Windows processes in kernel vs system

I have a few questions related to Windows processes in kernel and usermode.
If I have a hello world application, and a hello world driver that exposes a new system call, foo(), I am curious about what I can and can't do once I am in kernel mode.
For starters, when I write my new hello world app, I am given a new process, which means I have my own user mode VM space (lets keep it simple, 32 bit windows). So I have 2GB of space that I "own", I can poke and peek until my hearts content. However, I am bound by my process. I can't (lets not bring shared memory into this yet) touch anyone elses memory.
If, I write this hello world driver, and call it from my user app, I (the driver code) is now in kernel mode.
First clarification/questions:
I am STILL in the same process as the user mode app, correct? Still have the same PID?
Memory Questions:
Memory is presented to my process as VM, that is even if I have 1GB of RAM, I can still access 4GB of memory (2GB user / 2GB of kernel - not minding details of switches on servers, or specifics, just a general assumption here).
As a user process, I cannot peek at any kernel mode memory address, but I can do whatever I want to the user space, correct?
If I call into my hello world driver, from the driver code, do I still have the same view of the usermode memory? But now I also have access to any memory in kernel mode?
Is this kernel mode memory SHARED (unlike User mode, which is my own processes copy)? That is, writing a driver is more like writing a threaded application for a single process that is the OS (scheduling aside?)
Next question. As a driver, could I change the process that I am running. Say, I knew another app (say, a usermode webserver), and load the VM for that process, change it's instruction pointer, stack, or even load different code into the process, and then switch back to my own app? (I am not trying to do anything nefarious here, I am just curious what it really means to be in kernel mode)?
Also, once in kernel mode, can I prevent the OS from preempting me? I think (in Windows) you can set your IRQL level to do this, but I don't fully understand this, even after reading Solomons book (Inside Windows...). I will ask another question, directly related to IRQL/DPCs but, for now, I would love to know if a kernel driver has the power to set an IRQL to High and take over the system.
More to come, but answers to these questions would help.
Each process has a "context" that, among other things, contains the VM mappings specific to that process (<2 GB normally in 32bit mode). When thread executing in user mode enteres kernel mode (e.g. from a system call or IO request), the same thread is still executing, in the process, with the same context. PsGetCurrentProcessId will return the same thing at this point as GetCurrentProcessID would have just before in user mode (same with thread IDs).
The user memory mappings that came with the context are still in place upon entering kernel mode: you can access user memory from kernel mode directly. There are special things that need to be done for this to be safe though: Using Neither Buffered Nor Direct I/O. In particular, an invalid address access attempt in the user space range will raise a SEH exception that needs to be caught, and the contents of user memory can change at any time due to the action of another thread in that process. Accessing an invalid address in the kernel address range causes a bugcheck. A thread executing in user mode cannot access any kernel memory.
Kernel address space is not part of a process's context, so is mapped the same between all of them. However, any number of threads may be active in kernel mode at any one time, so it is not like a single threaded application. In general, threads service their own system calls upon entering kernel mode (as opposed to having dedicated kernel worker threads to handle all requests).
The underlying structures that save thread and process state is all available in kernel mode. Mapping the VM of another process is best done ahead of time from the other process by creating an MDL from that process and mapping it into system address space. If you just want to alter the context of another thread, this can be done entirely from user mode. Note that a thread must be suspended to change its context without having a race condition. Loading a module into a process from kernel mode is ill advised; all of the loader APIs are designed for use from user mode only.
Each CPU has a current IRQL that it is running at. It determines what things can interrupt what the CPU is currently doing. Only an event from a higher IRQL can preempt the CPU's current activity.
PASSIVE_LEVEL is where all user code and most kernel code executes. Many kernel APIs require the IRQL to be PASSIVE_LEVEL
APC_LEVEL is used for kernel APCs
DISPATCH_LEVEL is for scheduler events (known as the dispatcher in NT terminology). Running at this level will prevent you from being preempted by the scheduler. Note that it is not safe to have any kind of page fault at this level; there would be a deadlock possibility with the memory manager trying to retrieve pages. The kernel will bugcheck immediately if it has a page fault at DISPATCH_LEVEL or higher. This means that you can't safely access paged pool, paged code segments or any user memory that hasn't been locked (i.e. by an MDL).
Above this are levels connected to hardware device interrupt levels, known as DIRQL.
The highest level is HIGH_LEVEL. Nothing can preempt this level. It's used by the kernel during a bugcheck to halt the system.
I recommend reading Scheduling, Thread Context, and IRQL
A good primer for this topic would be found at: http://www.codinghorror.com/blog/archives/001029.html
As Jeff points out for the user mode memory space:
"In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode."
So your app will have no access to the Kernel Mode memory, infact your communication with the driver is probably through IOCTLs (i.e. IRPs).
The kernel however has access to everything, including to mappings for your user mode processes. This is a one way street, user mode cannot map into kernel mode for security and stability reasons. Even through kernel mode drivers can map into user mode memory I would advise against it.
At least that's the way it was back before WDF. I am not sure of the capabilities of memory mapping with user mode drivers.
See also: http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2Fe%2Fb%2Fa%2Feba1050f-a31d-436b-9281-92cdfeae4b45%2FKM-UMGuide.doc&ei=eAygSvfuAt7gnQe01P3gDQ&rct=j&q=user+mode+mapping+into+kernel+mode&usg=AFQjCNG1QYQMcIpcokMoQSWJlGSEodaBHQ

Resources