I'm reading https://www.kernel.org/doc/Documentation/DMA-API.txt and I don't understand why DMA pool is needed.
Why not having PAGE_SIZE DMA allocated memory dma_alloc_coherent and use offsets?
Also, why is dynamic DMA useful for networking device driver instead of reusing the same DMA memory?
What is the most performant for <1KB data transfers?
Warning: I'm not expert in linux kernel.
LDD book (which may be better reading to start) says that DMA pool works better for smaller dma regions (shorter than page) - https://static.lwn.net/images/pdf/LDD3/ch15.pdf page 447 or https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch15.html, "DMA pools" section:
A DMA pool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool. DMA pools are also useful in situations where you may be tempted to perform DMA to small areas embedded within a larger structure. Some very obscure driver bugs have been traced down to cache coherency problems with structure fields adjacent to small DMA areas. To avoid this problem, you should always allocate areas for DMA operations explicitly, away from other, non-DMA data structures. ... Allocations are handled with dma_pool_alloc
Same is stated in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
If your driver needs lots of smaller memory regions, you can write
custom code to subdivide pages returned by dma_alloc_coherent(),
or you can use the dma_pool API to do that. A dma_pool is like
a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
Also, it understands common hardware constraints for alignment,
like queue heads needing to be aligned on N byte boundaries.
So DMA pools are optimization for smaller allocations. You can use dma_alloc_coherent for every small dma memory individually (with larger overhead) or you can try to build your own pool (more custom code to manage offsets and allocations). But DMA pools are already implemented and they can be used.
Performance of methods should be profiled for your case.
Example of dynamic dma registration in network driver (used for skb fragments):
https://elixir.bootlin.com/linux/v4.6/source/drivers/net/ethernet/realtek/r8169.c
static struct sk_buff *rtl8169_alloc_rx_data
mapping = dma_map_single(d, rtl8169_align(data), rx_buf_sz,
DMA_FROM_DEVICE);
static int rtl8169_xmit_frags
mapping = dma_map_single(d, addr, len, DMA_TO_DEVICE);
static netdev_tx_t rtl8169_start_xmit
mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
static void rtl8169_unmap_tx_skb
dma_unmap_single(d, le64_to_cpu(desc->addr), len, DMA_TO_DEVICE);
Registering skb fragments for dma in-place can be better (if sg dma is supported by NIC chip) than copying every fragment from skb into some DMA memory. Check "Understanding Linux Network Internals" book for section "dev_queue_xmit Function" and Chapter 21; and skb_linearize
Example of DMA pool usage - nvme driver (prp is part of Submission Queue Element, Physical Region Pages, 64bit pointer, and "PRP List contains a list of PRPs with generally no offsets."):
https://elixir.bootlin.com/linux/v4.6/source/drivers/nvme/host/pci.c#L1807
static int nvme_setup_prp_pools(struct nvme_dev *dev)
{
dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
PAGE_SIZE, PAGE_SIZE, 0);
static bool nvme_setup_prps
prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
static void nvme_free_iod
dma_pool_free(dev->prp_small_pool, list[0], prp_dma);
Related
I would first like to give a brief description of the scenario that I am working on.
What I am trying to accomplish is to load image data from my user space application and transfer it over PCIe to a custom acceleration engine located inside a FPGA board.
The specifications of my host machine are:
Intel Xeon Processor with 16G ram.
64 Bit Debian Linux with kernel version 4.18.
The FPGA is a Virtex 7 KC705 development board.
The FPGA uses a PCIe controller (bridge) for the communication between the PCIe infrastructure and the AXI interface of the FPGA.
In addition, the FPGA is equiped with a DMA engine which is supposed to read data through the PCIe controller from the kernel memory and forward them to the accelerator.
Since in future implementations I would like to make multiple kernel allocations up to 256M, I have configured my kernel to support CMA and DMA Contiguous Allocator.
According to dmesg I can verify that my system reserves at startup the CMA area.
Regarding the acceleration procedure:
The driver initially allocates 4M kernel memory by using the dma_alloc_coherent() with GFP_KERNEL flag. This allocation is inside the range of the CMA.
Then from my user space application I call mmap with READ_PROT/WRITE_PROT and MAP_SHARED/MAP_LOCKED flags to map the previously allocated CMA memory and load the image data in there.
Once the image data is loaded I forward the dma_addr_t physical address of the CMA allocated memory and I start the DMA to transfer the data to the accelerator. When the acceleration is completed the DMA is supposed to write the processed data back to the same CMA kernel allocated memory.
On completion the user space application reads the processed data from the CMA memory and saves it to a .bmp file. When I check the "processed" image it is the same as the original one. I suppose that the processed data were never written to the CMA memory.
Is there some kind of memory protection that does not allow writing to the CMA memory when using GFP_KERNEL flag?
An interesting fact is that when I allocate kernel memory with dma_alloc_coherent but with either GFP_ATOMIC or GFP_DMA the processed data are written correctly to the kernel memory but unfortunately the allocated memory does not belong to the range of the CMA area.
What is wrong in my implementation?
Please let me know if you need more information!
In order to use mmap() I have adopted the debugfs file operations method.
Initially, I open a debugfs file as follows:
shared_image_data_file = open("/sys/kernel/debug/shared_image_data_mmap_value", O_RDWR);
The shared_image_data_mmap_value is my debugfs file which is created in my kernel driver and the shared_image_data_file is just an integer.
Then, I call mmap() from userspace as follows:
kernel_address = (unsigned int *)mmap(0, (4 * MBYTE), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, shared_image_data_file, 0);
When I call the mmap() function in user space the mmap file operation of my debugfs file executes the following function in the kernel driver:
dma_mmap_coherent(&dev->dev, vma, shared_image_data_virtual_address, shared_image_data_physical_address, length);
The shared_image_data_virtual_address is a pointer of type uint_64_t while the shared_image_data_physical_address is of type dma_addr_t and they where created earlier when I used the following code to allocate memory in kernel space:
shared_image_data_virtual_address = dma_alloc_coherent(&dev->dev, 4 * MBYTE, &shared_image_data_physical_address, GFP_KERNEL);
The address that I pass to the DMA of the FPGA is the shared_image_data_physical_address.
I hope that the above are helpful.
Thank you!
I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space
General malloc and mmap description
malloc (or any allocation function) is supposed to allocate memory for applications. Standard glibc malloc implementation uses sbrk() system call to allocate the memory. The memory allocated to an application is not backup by disk. Only when the application is swept out, the contents of memory are moved to disk (pre-configured swap disk).
The other method to allocate the memory is through the use of mmap. mmap system call creates mapping in virtual address space for calling process. Following is mmap function declaration as per POSIX standard.
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
/* Describe the meaning for few important parameters to mmap */
mmap system call can also be used to allocate memory. Typically this is used to load, application binaries or static libraries. For example following mmap call will allocate memory, without a backing file.
address = __mmap (0, length, PROT_READ|PROT_WRITE,MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
Flags
MAP_ANONYMOUS: The mapping is not backed by any file; its contents
are initialized to zero.
MAP_PRIVATE: Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.
dmalloc
dmalloc is a new API which allocates memory using a disk backed file i.e. without MAP_ANONYMOUS and MAP_PRIVATE to mmap. The dmalloc would be particularly useful with SSDs, which has very low read/write latency as compared to HDD. Since the file is mapped into the RAM, the dlmalloc will also benefit from high speed RAM.
Alternatives
A SSD can also be configured as a highest priority swap device, however this approach suffers from HDD optimized swapping algorithm inside Linux kernel. The swapping algorithm tries to cluster application pages on swap. When a data from swap is needed it reads
the complete cluster (read-ahead). If an application is doing random IOs, the read-ahead data would cause unnecessary IOs to disk.
Question:-
what is ment by "allocates memory using a disk backed file i.e. without MAP_ANONYMOUS and MAP_PRIVATE to mmap." which flag i should use apart from those two.
how i creat on-write backup of memory allocated to an application.
I never heard from dmalloc but like you mention it it looks like a mix between malloc (pure memory allocation) and mmap (direct mapping of memory to disk). dmalloc seems to allocate memory backed by disk, but more performant than mmap on slow disks (e.g. SSD). I could imagine that it groups write operations before actually flushing the writes to disk, whereas mmap is more or less a "virtual memory window" on a disk file.
As for your questions.
1) MAP_ANONYMOUS and MAP_PRIVATE are flags for use with mmap. The fact that these flags are mentioned as not being used, makes me think dmalloc is a fresh implementation and has no relationship to mmap.
2) dmalloc seems to be suited for what you say: it "backs up" memory to disk, similar to mmap. You may need to read the details of the documentation to know when exactly you have guarantee that the data is effectively on disk (caching,...)
While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:
... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.
My questions are:
1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
upd for q3: some tests of mremap vs memcpy
Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
See thejh's answer.
Who are the actors?
You have at least three actors with your Qt example.
Qt Vector class
glibc's realloc()
Linux's mremap
QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.
An answer
Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
First off, faster ... than mremap is mis-using mremap() as R notes there.
There are several things that make mremap() valuable as a primitive for realloc().
Reduced memory consumption.
Preserving page mappings.
Avoiding moving data.
Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.
Reduce memory consumption
Consider a naive realloc().
void *realloc(void *ptr, size_t size)
{
size_t old_size = get_sz(ptr); /* From bin, address, map table, etc */
if(size <= old_size) {
resize(ptr);
return ptr;
}
void * new_p = malloc(size);
if(new_p) {
memcpy(new_p, ptr, old_size); /* fully committed old_size + new size */
free(ptr);
}
return new_p;
}
In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.
Preserve page mappings
Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.
If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.
If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.
Avoid moving data
The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.
CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.
On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.
Conclusion
You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,
The realloc() memory should not be shared with a parent or child process.
The memory should not be sparse (mostly zero or untouched).
The system should not be under memory pressure using swap.
The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.
For Linux, see man 2 mremap:
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same
time (controlled by the flags argument and the available virtual address space).
Is the DMA address returned from this call the same as the physical address? LDD3 says the DMA address should be treated as opaque by the driver. I want to mmap this DMA buffer so user-space can read/write directly to it. Question is what PFN should I specify for the remap_pfn_range (which to my pleasant surprise now (kernel 3.4+) works for conventional memory same as for I/O memory). Can I just cast the DMA address to unsigned long and convert that to PFN? Isn't this a violation of what LDD3 said about opaqueness?
Does dma_alloc_coherent always use __get_free_pages internally? Does this mean the region is potentially always over-allocated (since first function takes bytes but second function allocates in units of pages)?
Is there a way to setup a single streaming mapping for multiple consecutive pages obtained from call to __get_free_pages? dma_map_page applies to only single pages.
No, the address returned is a virtual address, otherwise you wouldn't be able to access it from kernel space. It's dma_handle which represents the physical address, but it's opaque. You need to use virt_to_phys on the address it returns and then pass this to remap_pfn_range.
I don't believe it does (it's likely to be platform dependent though), but it does allocate pages. If you want smaller amounts of memory for DMA you should use dma_pool_create and then allocate regions from there.
You can use dma_map_single instead of dma_map_page.
I'd recommend consulting DMA-API.txt for more details on some of this stuff.