I have a Windows console application that uses a parallel IO card for high speed data transmission. (General Standards HPDI32ALT)
My process is running in user mode, however, I am sure somewhere behind the device's API there is some kernel mode driver activity (PCI DMA transfers, reading device status registers etc..) The working model is roughly this:
at startup: I request a pointer to an IO buffer from API.
in my main loop:
block on API waiting for room in device's buffer (low watermark)
fill the IO buffer with transmission data
begin transmission to device by passing it the pointer to the IO buffer (during this time the API uses DMA on PCI bus to move the data to the card)
block on API waiting for IO to complete
The application appears to be working correctly with proper data rate and sustained throughput for long periods of time, however, when I look at the process in sys internals tool process explorer I see a large number of page faults (~6k per second). I am moving ~30MB/s to the card.
I have plenty of RAM and am reasonably sure the page faults are not disk IO related.
Any thoughts on what could be causing the page faults? I also have a receive side to this application that is using an identical IO card in receive mode. The receive mode use of the API does not cause a large number page faults.
Could the act of moving the IO buffer to kernel mode cause page faults?
So your application asks the driver for a memory buffer and you copy the send data into that buffer? That's a pretty strange model, usually you let the application manage the buffers.
If you're faulting 6K pages/s and you're only transfering 30MB/s, you're almost getting a page fault for every page you transfer. When you get the data buffer from the driver, is it always zero filled? I'm wondering if you're getting demand zero faults for every transfer.
-scott
Related
I am working on a userspace PCI driver which uses SPDK/VFIO APIs to do dma access.
Currently for each DMA allocation request I need to fill up structure spdk_vfio_dma_map then call system call ioctl(fd, VFIO_IOMMU_MAP_DMA, &dma_map) to map the DMA region through IOMMU. Then later call ioctl(fd, VFIO_IOMMU_UNMAP_DMA, &dma_map) to unmap the IOMMU mapping.
This is working fine so far and looks like it's what SPDK examples are using. However I am wondering if there is a way to pre-allocate all memory buffer in userspace then in each DMA allocation request just use the pre-allocated memory instead of doing ioctl call each time?
Any idea is well appreciated.
Don't know if I get the issue but the whole idea (of DPDK and SPDK) is to allocate all the memory you are using on application start or driver probe.
If you are using memory that is under application control all the time then you don't need to do VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA every DMA transaction. If this is not the case you have two options:
Do the VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA for every IO
Copy the payload to the memory that is already registered in VFIO_IOMMU_MAP_DMA.
First option is better for huge memory blocks, while second is better for small IO chunks.
I have a Linux device driver which allows a userspace process to mmap() certain regions of the device's MMIO space for writing. The device may at some point decide to revoke access to the region, and will notify the driver when this happens. The driver (asynchronously) notifies the userspace process to stop using this region.
I'd like the driver to immediately zap the PTEs for this mapping so they can be returned to device control, however, the userspace process might still be finishing a write. I'd like to simply discard these writes. The user does not need to know which writes made it to the device and which writes were discarded. What can the driver's fault handler do after zapping the PTEs that can discard writes to the region harmlessly?
For the userspace process to make progress, the PTE needs to end up pointing to a writeable page.
If you don't want it writing to your device MMIO region, this implies you'll need to allocate a page of normal memory for the write to go to, just like the fault handler does for an anonymous VMA.
Alternatively, you could let your userspace task take a SIGBUS when this revocation event occurs, and just specify that a task using this device should expect this to happen and must install a SIGBUS handler that uses longjmp() to cancel its attempt to write to the device. The downside of this approach - apart from the additional complexity it dumps onto userspace - is that it makes using your device difficult from a library, as signal handlers are process-global state.
I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.
I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.
Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).
Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs.
If the process is running on another CPU, that could be a problem, too.
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.
Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.
I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.
We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.
You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.