Linux Driver and API architecture for a data acquisition device - memory-management

We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.

You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.

Related

Silently discard writes to mmap region

I have a Linux device driver which allows a userspace process to mmap() certain regions of the device's MMIO space for writing. The device may at some point decide to revoke access to the region, and will notify the driver when this happens. The driver (asynchronously) notifies the userspace process to stop using this region.
I'd like the driver to immediately zap the PTEs for this mapping so they can be returned to device control, however, the userspace process might still be finishing a write. I'd like to simply discard these writes. The user does not need to know which writes made it to the device and which writes were discarded. What can the driver's fault handler do after zapping the PTEs that can discard writes to the region harmlessly?
For the userspace process to make progress, the PTE needs to end up pointing to a writeable page.
If you don't want it writing to your device MMIO region, this implies you'll need to allocate a page of normal memory for the write to go to, just like the fault handler does for an anonymous VMA.
Alternatively, you could let your userspace task take a SIGBUS when this revocation event occurs, and just specify that a task using this device should expect this to happen and must install a SIGBUS handler that uses longjmp() to cancel its attempt to write to the device. The downside of this approach - apart from the additional complexity it dumps onto userspace - is that it makes using your device difficult from a library, as signal handlers are process-global state.

How to avoid data copy in NDIS filter driver

I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.

Event using FTD2XX_NET.DLL

I am using a FT232RL chip with FTD2XX_NET.dll I've made a program which writes and reads data to/from AVR atmega32 mcu. First writes data, then reads data as answer.
Now, i want to make an event which indicated me if there's available unreaded data, only when AVR sends data to FTDI buffer and ONLY then. Whithout forcing my program to making loops for checking available data. For my purpose, i want to do the mcu to sends data only when he wants, and the PC must to knows when there's new data in FTDI buffer's chip.
I know that It's impossible for the pc to know when AVR sending data to the FTDI. But this which I mean it's that I need some way for my program to know if FTDI have New unreaded data to it's own buffer.
I don't won't to running read operator over and over in an infinity loop as I do now.
You should create a read thread which does your reading in the background. Then from that thread you can signal an even to notify another part of your application when you have data. I'm not sure what language you are using but you should easily be able to find an example of threading and event notification with a Google search.

bypassing tty layer and copy to user

I would like to copy data to user space from kernel module which receives data from serial port and transfers it to DMA, which in turn forwards the data to tty layer and finally to user space.
the current flow is
serial driver FIFO--> DMA-->TTY layer -->User space (the data to tty layer is emptied from DMA upon expiration of timer)
What I want to achieve is
serial driver FIFO-->DMA-->user space. (I am OK with using timer to send the data to user space, if there is a better way let me know)
Also the kernel module handling the serialFIFO->DMA is not a character device.
I would like to bypass tty layer completely. what is the best way to achieve so?
Any pointers/code snippet would be appreciated.
In >=3.10.5 the "serial FIFO" that you refer to is called a uart_port. These are defined in drivers/tty/serial.
I assume that what you want to do is to copy the driver for your UART to a new file, then instead of using uart_insert_char to insert characters from the UART RX FIFO, you want to insert the characters into a buffer that you can access from user space.
The way to do this is to create a second driver, a misc class device driver that has file operations, including mmap, and that allocates kernel memory that the driver's mmap file operation function associates with the userspace mapped memory. There is a good example of code for this written by Maxime Ripard. This example was written for a FIQ handled device, but you can use just the probe routine's dma_zalloc_coherent call and the mmap routine, with it's call to remap_pfn_range, to do the trick, that is, to associate a user space mmap on the misc device file with the alloc'ed memory.
You need to connect the memory that you allocated in your misc driver to the buffer that you write to in your UART driver using either a global void pointer, or else by using an exported symbol, if your misc driver is a module. Initialize the pointer to a known invalid value in the UART driver and test it to make sure the misc driver has assigned it before you try to insert characters to the address to which it points.
Note that you can't add an mmap function to the UART driver directly because the UART driver class does not support an mmap file operation. It only supports the operations defined in the include/linux/serial_core.h struct uart_ops.
Admittedly this is a cumbersome solution - two device drivers, but the alternative is to write a new device class, a UART device that has an mmap operation, and that would be a lot of work compared with the above solution although it would be elegant. No one has done this to date because as Jonathan Corbet say's "...not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices", though this is exactly what you are asking for.
I implemented this solution for a polling mode UART driver based on the mxs-auart.c code and Maxime's example. It was non-trivial effort but mostly because I am using a FIQ handler for the polling timer. You should allow two to three weeks to get the whole thing up and running.
The DMA aspect of your question depends on whether the UART supports DMA transfer mode. If so, then you should be able to set it using the serial flags. The i.MX28's PrimeCell auarts support DMA transfer but for my application there was no advantage over simply reading bytes directly from the UART RX FIFO.

Why would a device driver cause page faults?

I have a Windows console application that uses a parallel IO card for high speed data transmission. (General Standards HPDI32ALT)
My process is running in user mode, however, I am sure somewhere behind the device's API there is some kernel mode driver activity (PCI DMA transfers, reading device status registers etc..) The working model is roughly this:
at startup: I request a pointer to an IO buffer from API.
in my main loop:
block on API waiting for room in device's buffer (low watermark)
fill the IO buffer with transmission data
begin transmission to device by passing it the pointer to the IO buffer (during this time the API uses DMA on PCI bus to move the data to the card)
block on API waiting for IO to complete
The application appears to be working correctly with proper data rate and sustained throughput for long periods of time, however, when I look at the process in sys internals tool process explorer I see a large number of page faults (~6k per second). I am moving ~30MB/s to the card.
I have plenty of RAM and am reasonably sure the page faults are not disk IO related.
Any thoughts on what could be causing the page faults? I also have a receive side to this application that is using an identical IO card in receive mode. The receive mode use of the API does not cause a large number page faults.
Could the act of moving the IO buffer to kernel mode cause page faults?
So your application asks the driver for a memory buffer and you copy the send data into that buffer? That's a pretty strange model, usually you let the application manage the buffers.
If you're faulting 6K pages/s and you're only transfering 30MB/s, you're almost getting a page fault for every page you transfer. When you get the data buffer from the driver, is it always zero filled? I'm wondering if you're getting demand zero faults for every transfer.
-scott

Resources