manually allocate data buffer for skb struct - linux-kernel

I have two boards that connected via the PCIE bus. They can exchange data via the pre-allocated message buffers. Now I try to implement a virtual network interface based on this connection.
Referring some network driver sources show that there are two methods they implemented the receive path:
Use dev_alloc_skb() to allocate the skb inside the receive function and copy the data to the buffer allocated for this skb.
Use dev_alloc_skb() to allocate the skb and put their buffers into the RX ring.
In these two cases, the buffer are allocated by the dev_alloc_skb(). I would like to just allocate the skb control header only and point the data pointers to my message buffer. We also have to modify the skb_free() to put the message buffer back to the message pool instead.
May anyone please tell me if there are any reference code which also uses the similar approach or please propose a better approach that minimize code change. Any suggestions are appreciated. Thanks in advanced!

You could use build_skb:
So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.
build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call
skb_reserve() right after build_skb() to let skb->data points to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN)

Related

How can I improve SPDK performance on userspace DMA access?

I am working on a userspace PCI driver which uses SPDK/VFIO APIs to do dma access.
Currently for each DMA allocation request I need to fill up structure spdk_vfio_dma_map then call system call ioctl(fd, VFIO_IOMMU_MAP_DMA, &dma_map) to map the DMA region through IOMMU. Then later call ioctl(fd, VFIO_IOMMU_UNMAP_DMA, &dma_map) to unmap the IOMMU mapping.
This is working fine so far and looks like it's what SPDK examples are using. However I am wondering if there is a way to pre-allocate all memory buffer in userspace then in each DMA allocation request just use the pre-allocated memory instead of doing ioctl call each time?
Any idea is well appreciated.
Don't know if I get the issue but the whole idea (of DPDK and SPDK) is to allocate all the memory you are using on application start or driver probe.
If you are using memory that is under application control all the time then you don't need to do VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA every DMA transaction. If this is not the case you have two options:
Do the VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA for every IO
Copy the payload to the memory that is already registered in VFIO_IOMMU_MAP_DMA.
First option is better for huge memory blocks, while second is better for small IO chunks.

How to avoid data copy in NDIS filter driver

I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.

Linux Driver and API architecture for a data acquisition device

We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.
You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.

Physical Memory Allocation in Kernel

I am writting a Kernel Module that is going to trigger and external PCIe device to read a block of data from my internel memory. To do this I need to send the PCIe device a pointer to the physical memory address of the data that I would like to send. Ultimately this data is going to be written from Userspace to the kernel with the write() function (userspace) and copy_from_user() (kernel space). As I understand it, the address that my kernel module will see is still a virtual memory address. I need a way to get the physical address of it so that the PCIe device can find it.
1) Can I just use mmap() from userspace and place my data in a known location in DDR memory, instead of using copy_from_user()? I do not want to accidently overwrite another processes data in memory though.
2) My kernel module reserves PCIe data space at initialization using ioremap_nocache(), can I do the same from my kernel module or is it a bad idea to treat this memory as io memory? If I can, what would happen if the memory that I try to reserve is already in use? I do not want to hard code a static memory location and then find out that it is in use.
Thanks in advance for you help.
You don't choose a memory location and put your data there. Instead, you ask the kernel to tell you the location of your data in physical memory, and tell the board to read that location. Each page of memory (4KB) will be at a different physical location, so if you are sending more data than that, your device likely supports "scatter gather" DMA, so it can read a sequence of pages at different locations in memory.
The API is this: dma_map_page() to return a value of type dma_addr_t, which you can give to the board. Then dma_unmap_page() when the transfer is finished. If you're doing scatter-gather, you'll put that value instead in the list of descriptors that you feed to the board. Again if scatter-gather is supported, dma_map_sg() and friends will help with this mapping of a large buffer into a set of pages. It's still your responsibility to set up the page descriptors in the format expected by your device.
This is all very well written up in Linux Device Drivers (Chapter 15), which is required reading. http://lwn.net/images/pdf/LDD3/ch15.pdf. Some of the APIs have changed from when the book was written, but the concepts remain the same.
Finally, mmap(): Sure, you can allocate a kernel buffer, mmap() it out to user space and fill it there, then dma_map that buffer for transmission to the device. This is in fact probably the cleanest way to avoid copy_from_user().

bypassing tty layer and copy to user

I would like to copy data to user space from kernel module which receives data from serial port and transfers it to DMA, which in turn forwards the data to tty layer and finally to user space.
the current flow is
serial driver FIFO--> DMA-->TTY layer -->User space (the data to tty layer is emptied from DMA upon expiration of timer)
What I want to achieve is
serial driver FIFO-->DMA-->user space. (I am OK with using timer to send the data to user space, if there is a better way let me know)
Also the kernel module handling the serialFIFO->DMA is not a character device.
I would like to bypass tty layer completely. what is the best way to achieve so?
Any pointers/code snippet would be appreciated.
In >=3.10.5 the "serial FIFO" that you refer to is called a uart_port. These are defined in drivers/tty/serial.
I assume that what you want to do is to copy the driver for your UART to a new file, then instead of using uart_insert_char to insert characters from the UART RX FIFO, you want to insert the characters into a buffer that you can access from user space.
The way to do this is to create a second driver, a misc class device driver that has file operations, including mmap, and that allocates kernel memory that the driver's mmap file operation function associates with the userspace mapped memory. There is a good example of code for this written by Maxime Ripard. This example was written for a FIQ handled device, but you can use just the probe routine's dma_zalloc_coherent call and the mmap routine, with it's call to remap_pfn_range, to do the trick, that is, to associate a user space mmap on the misc device file with the alloc'ed memory.
You need to connect the memory that you allocated in your misc driver to the buffer that you write to in your UART driver using either a global void pointer, or else by using an exported symbol, if your misc driver is a module. Initialize the pointer to a known invalid value in the UART driver and test it to make sure the misc driver has assigned it before you try to insert characters to the address to which it points.
Note that you can't add an mmap function to the UART driver directly because the UART driver class does not support an mmap file operation. It only supports the operations defined in the include/linux/serial_core.h struct uart_ops.
Admittedly this is a cumbersome solution - two device drivers, but the alternative is to write a new device class, a UART device that has an mmap operation, and that would be a lot of work compared with the above solution although it would be elegant. No one has done this to date because as Jonathan Corbet say's "...not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices", though this is exactly what you are asking for.
I implemented this solution for a polling mode UART driver based on the mxs-auart.c code and Maxime's example. It was non-trivial effort but mostly because I am using a FIQ handler for the polling timer. You should allow two to three weeks to get the whole thing up and running.
The DMA aspect of your question depends on whether the UART supports DMA transfer mode. If so, then you should be able to set it using the serial flags. The i.MX28's PrimeCell auarts support DMA transfer but for my application there was no advantage over simply reading bytes directly from the UART RX FIFO.

Resources