I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space
Related
I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.
I am developing a Linux kernel driver on 3.4. The purpose of this driver is to provide a mmap interface to Userspace from a buffer allocated in an other kernel module likely using kzalloc() (more details below). The pointer provided by mmap must point to the first address of this buffer.
I get the physical address from virt_to_phys(). I give this address right shifted by PAGE_SHIFT to remap_pfn_range() in my mmap fops call.
It is working for now but it looks to me that I am not doing the things properly because nothing ensure me that my buffer is at the top of the page (correct me if I am wrong). Maybe mmap()ing is not the right solution? I have already read the chapter 15 of LDD3 but maybe I am missing something?
Details:
The buffer is in fact a shared memory region allocated by the remoteproc module. This region is used within an asymetric multiprocessing design (OMAP4). I can get this buffer thanks to the rproc_da_to_va() call. That is why there is no way to use something like get_free_pages().
Regards
Kev
Yes, you're correct: there is no guarantee that the allocated memory is at the beginning of a page. And no simple way for you to both guarantee that and to make it truly shared memory.
Obviously you could (a) copy the data from the kzalloc'd address to a newly allocated page and insert that into the mmap'ing process' virtual address space, but then it's not shared with the original datum created by the other kernel module.
You could also (b) map the actual page being allocated by the other module into the process' memory map but it's not guaranteed to be on a page boundary and you'd also be sharing whatever other kernel data happened to reside in that page (which is both a security issue and a potential source of kernel data corruption by the user-space process into which you're sharing the page).
I suppose you could (c) modify the memory manager to return every piece of allocated data at the beginning of a page. This would work, but then every time a driver wants to allocate 12 bytes for some small structure, it will in fact be allocating 4K bytes (or whatever your page size is). That's going to waste a huge amount of memory.
There is simply no way to trick the processor into making the memory appear to be at two different offsets within a page. It's not physically possible.
Your best bet is probably to (d) modify the other driver to allocate the specific bits of data that you want to make shared in a way that ensures alignment on a page boundary (i.e. something you write to replace kzalloc).
I am reading Linux Kernel Development by Robert Love. I don't understand this paragraph about the bio structure:
The basic container for block I/O within the kernel is the bio structure, which is defined in <linux/bio.h>. This structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By
allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.
What exactly does flight(active) means?
"As a list of segments" -- are we talking about this segmentation?
What does "By allowing the buffers ... in memory" mean?
Block Devices are such device which deals with a chunk (512, 1024 bytes) of data during an I/O transaction. "struct bio" is available for block I/O operations from Kernel-Space. This structure is commonly used in block device driver development.
Q1) What exactly does flight(active) means?
Block devices are usually implemented with a File-System meant for storing files. Now when ever an user-space application initiates a File I/O operation (read, write), the kernel in turn initiates a sequence of Block I/O operation through File-System Manager. The "struct bio" keeps track of all Block I/O transactions (initiated by user app) that is to be processed. That's what is mentioned here as flight/active regions.
"Q2) As a list of segments" -- are we talking about this segmentation?
Memory buffers are required by the kernel to hold data to/from Block device.
In kernel there are two possiblilites in which the memory is allocated.
Virtual Address Continuous - Physical Address Continuous (Using kmalloc() - Provides good Performance but limited in size)
Virtual Address Continuous - Physical Address Non-continuous (Using vmalloc() - For huge memory size requirement)
Here a segment indicates the first type i.e. continuous physical memory which is used for block IO transfer. List of segment indicates a set of such continuous physical memory regions. Note that the list elements are non-continuous memory segments.
Q3) What does "By allowing the buffers ... in memory" mean?
Scatter-gather is feature which allows data transfer from/to multiple non-continuous memory location to/from device, in a single shot (read/write transaction). Here "struct bio" keeps record of multiple segments that is to be processed. Each segment is a continuous memory region whereas multiple segments are non-continuous with one another. "struct bio" provides capability to the kernel to perform scatter-gather feature.
"In flight" means an operation that has been requested, but hasn't been initiated yet.
"Segment" here means a range of memory to be read or written, a contiguous
piece of the data to be transferred as part of the operation.
"Scatter/gather I/O" is meant by scatter operations that take a contiguous range of data on disk and distributes pieces of it into memory, gather takes separate ranges of data in memory and writes them contiguously to disk. (Replace "disk" by some suitable device in the preceding.) Some I/O machinery is able to do this in one operation (and this is getting more common).
1) "In flight" means "in progress"
2) No
3) Not quite sure :)
I want to setup a DMA mapping for a memory buffer allocated outside my control. dma_map_single appears the right API to use but my HW has a restriction due to which the mapping must not cross some power-of-two boundary say for e.g. 1K. The buffer being mapped is of size less than the boundary value always but otherwise variable. So it looks like DMA pools may not work since they need a fixed size even though the "allocation" part is sort of what I need.
Should I just keep doing dma_map_single and check if mapping meets my requirement and release mapping if it does not? Can this cause same mapping to potentially be returned causing a never ending search? If so, I could hang on to the unfit mappings till a fit one is found and then release all the unfit mappings in one shot. These however don't sound like good ideas.
Does anyone have other/better ideas?
Thanks.
If you can't guarantee that the buffer you are passed meets your criteria, you may need to allocate an auxiliary buffer and copy to/from that buffer before you DMA. On platforms without an IOMMU or other address translation hardware (eg classic x86, ARM, etc), the DMA mapping operation is really just converting to a physical address. So if you unmap and try again with the same buffer, you'll always get back the same DMA address.
On most (all?) other platforms that do have an IOMMU, the translation is still done on chunks >= PAGE_SIZE. In other words, if you're on a platform with 4K pages, and you do DMA mapping on a buffer at 0xABCDExxx, you'll always get a DMA address like 0xFGHIJxxx where the low part of the address "xxx" stays the same. (This is because the IOMMU works like a normal MMU and only looks up the page translation, and leaves the low 12 or whatever bits alone)
So in essentially all cases on all platforms, you can't use the DMA API to fix up the alignment of the buffer you get passed. As I said, I think the only alternative if the buffers you get passed in don't meet your alignment requirements is to use a bounce buffer. The DMA pool API is a fine way to allocate these bounce buffers -- it's no problem if you sometimes need a smaller buffer; it's fine to leave some of the memory you get back unused.
I am testing the throughput of an interface on Linux. I am using the DMA todo the data transfer. DMA needs contiguous memory location. But the kmalloc is unable to allocate more then 1MB. Is there any other way to create big buffer location upto 100M Bytes?
I thought kmalloc couldn't allocate over 128kB, how did you get it to allocate 1MB ?
Anyway, assuming you're working on a freshly booted system, you can reserve up to 2048 contiguous pages. Pages are generally 4k, so this makes 8MB.
_get_free_pages(_GFP_DMA, get_order(2048));
However, if you need more memory, you should do the allocation at boot-time.
If you are writing a driver, this can be achieved with the alloc_bootmem_* functions.
If you are writing a module, you have to pass mem= argument to your kernel and later use ioremap.
For example, if you have 2GB, you can pass mem=1GB to forbid the kernel from using the upper 1GB, and later call ioremap(0x40000000, 0x40000000) to get access to the upper 1GB, just for you.
But you know, you should just split your huge buffer into many small ones, it'll be much easier and much more like real-life applications.