I am new to linux kernel. And recently, i've went through the sendfile syscall in kernel 2.6.33. The following is the sequence of my journey:
do_sendfile()
=> do_splice_direct()
=> splice_direct_to_actor()
=> do_splice_to()
=> do_splice_from()
=> splice_read,splice_write
Throughout this sequence, I didn't find the place where splice use the DMA copy. So where is the DMA copying taking place?
Splice doesn't do any DMA copy. In fact the major usage of splice is to avoid copying at all - it tries to pass references to memory pages instead of copying the buffers.
The DMA mentioned in relation to splice will happen at the "leaf" - The origin of these pages that splice passes references to around will be created by, for example, a disk controller DMA into the buffer and will be sent by an Ethernet controller DMA of the content of the page as part of the packet - at least in a "perfect" zero copy sceanrio, which is difficult to achieve and rare.
Splice doesn't do the DMA - it enables no copying between the first DMA to the last.
As I understand it the splice_* infrastructure does it's very best to minimise the amount of actual copying that is done. At best the reader is reading from the same set of pages the writer is filling.
There are some excellent articles on LWN describing the various bits of splice() including the new system call.
Related
What is the difference between copying from user space buffer to kernel space buffer and, mapping user space buffer to kernel space buffer and then copying kernel space buffer to another kernel data structure?
What I meant to say is:
The first method is copy_from_user() function.
The second method is say, a user space buffer is mapped to kernel space and the kernel is passed with physical address(say using /proc/self/pagemap), then kernel space calls phys_to_virt() on the passed physical address to get it's corresponding kernel virtual address. Then kernel copies the data from one of its data structures say skb_buff to the kernel virtual address it got from the call to phys_to_virt() call.
Note: phys_to_virt() adds an offset of 0xc0000000 to the passed physical address to get kernel virtual address, right?
The second method describes the functionality in DPDK for KNI module and they say in documentation that it eliminates the overhead of copying from user space to kernel space. Please explain me how.
It really depends on what you're trying to accomplish, but still some differences I can think about?
To begin with, copy_from_user has some built-in security checks that should be considered.
While mapping your data "manually" to kernel space enables you to read from it continuously, and maybe monitor something that the user process is doing to the data in that page, while using the copy_to_user method will require constantly calling it to be aware of changes.
Can you elaborate on what you are trying to do?
I am working on a userspace PCI driver which uses SPDK/VFIO APIs to do dma access.
Currently for each DMA allocation request I need to fill up structure spdk_vfio_dma_map then call system call ioctl(fd, VFIO_IOMMU_MAP_DMA, &dma_map) to map the DMA region through IOMMU. Then later call ioctl(fd, VFIO_IOMMU_UNMAP_DMA, &dma_map) to unmap the IOMMU mapping.
This is working fine so far and looks like it's what SPDK examples are using. However I am wondering if there is a way to pre-allocate all memory buffer in userspace then in each DMA allocation request just use the pre-allocated memory instead of doing ioctl call each time?
Any idea is well appreciated.
Don't know if I get the issue but the whole idea (of DPDK and SPDK) is to allocate all the memory you are using on application start or driver probe.
If you are using memory that is under application control all the time then you don't need to do VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA every DMA transaction. If this is not the case you have two options:
Do the VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA for every IO
Copy the payload to the memory that is already registered in VFIO_IOMMU_MAP_DMA.
First option is better for huge memory blocks, while second is better for small IO chunks.
I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.
I am writting a Kernel Module that is going to trigger and external PCIe device to read a block of data from my internel memory. To do this I need to send the PCIe device a pointer to the physical memory address of the data that I would like to send. Ultimately this data is going to be written from Userspace to the kernel with the write() function (userspace) and copy_from_user() (kernel space). As I understand it, the address that my kernel module will see is still a virtual memory address. I need a way to get the physical address of it so that the PCIe device can find it.
1) Can I just use mmap() from userspace and place my data in a known location in DDR memory, instead of using copy_from_user()? I do not want to accidently overwrite another processes data in memory though.
2) My kernel module reserves PCIe data space at initialization using ioremap_nocache(), can I do the same from my kernel module or is it a bad idea to treat this memory as io memory? If I can, what would happen if the memory that I try to reserve is already in use? I do not want to hard code a static memory location and then find out that it is in use.
Thanks in advance for you help.
You don't choose a memory location and put your data there. Instead, you ask the kernel to tell you the location of your data in physical memory, and tell the board to read that location. Each page of memory (4KB) will be at a different physical location, so if you are sending more data than that, your device likely supports "scatter gather" DMA, so it can read a sequence of pages at different locations in memory.
The API is this: dma_map_page() to return a value of type dma_addr_t, which you can give to the board. Then dma_unmap_page() when the transfer is finished. If you're doing scatter-gather, you'll put that value instead in the list of descriptors that you feed to the board. Again if scatter-gather is supported, dma_map_sg() and friends will help with this mapping of a large buffer into a set of pages. It's still your responsibility to set up the page descriptors in the format expected by your device.
This is all very well written up in Linux Device Drivers (Chapter 15), which is required reading. http://lwn.net/images/pdf/LDD3/ch15.pdf. Some of the APIs have changed from when the book was written, but the concepts remain the same.
Finally, mmap(): Sure, you can allocate a kernel buffer, mmap() it out to user space and fill it there, then dma_map that buffer for transmission to the device. This is in fact probably the cleanest way to avoid copy_from_user().
am trying to move data from a buffer in kernel space into the hard
disk without having to incur any additional copies from kernel buffer to
user buffers or any other kernel buffers. Any ideas/suggestions would be
most helpful.
The use case is basically a demux driver which collects data into a
demux buffer in kernel space and this buffer has to be emptied
periodically by copying the contents into a FUSE-based partition on the
disk. As the buffer gets full, a user process is signalled which then
determines the sector numbers on the disk the contents need to be copied
to.
I was hoping to mmap the above demux kernel buffer into user address
space and issue a write system call to the raw partition device. But
from what I can see, the this data is being cached by the kernel on its
way to the Hard Disk driver. And so I am assuming that involves
additional copies by the linux kernel.
At this point I am wondering if there is any other mechansim to do this
without involving additional copies by the kernel. I realize this is an
unsual usage scenario for non-embedded environments, but I would
appreciate any feedback on possible options.
BTW - I have tried using O_DIRECT when opening the raw partition, but
the subsequent write call fails if the buffer being passed is the
mmapped buffer.
Thanx!
You need to expose your demux buffer as a file descriptor (presumably, if you're using mmap() then you're already doing this - great!).
On the kernel side, you then need to implement the splice_read member of struct file_operations.
On the userspace side, create a pipe(), then use splice() twice - once to move the data from the demux file descriptor into the pipe, and a second time to move the data from the pipe to the disk file. Use the SPLICE_F_MOVE flag.
As documented in the splice() man page, it will avoid actual copies where it can, by copying references to pages of kernel memory rather than the pages themselves.