V4L2 MMAPed memory only bufferable - linux-kernel

I use an Freescale i.MX6Q board from Phytec. On it runs a yocto/poky based OS using Kernel 3.19.5 with some i.MX IPU and v4l2 and media bus drivers.
My issue is that I want to accelerate an UYVY conversion. Trying out varous techniques (MT, opencv OCL, Neon, ...). A standard integer based conversion using 4 threads takes between 4 to 8 ms of an 640x480 image and 17-25 ms for 1920x1080. But only if I copy the v4l2 buffer to an userspace buffer (time not included above). If I directly convert from the v4l2 buffer to some userspace buffer it takes about 4-8x times as long. Which indicates that this buffer might be uncached (L2). So I dug further and found that the mmaped buffers are allocated by the vb2 dma routines, which use dma_alloc_coherent, which in turn allocates buffers with just the BUFFERABLE flag. From my understanding this means that it does not use the cache right?
The thing is that in this case, as soon as the buffer is dequeued the hardware will never write to that buffer and has finished any previous operations and I do neither, so it makes no sense to not cache the buffer until I queue it back.
Since the you can not force caching from userspace, I thought of using user pointers as buffers. But as far as I can see, although the driver says it supports user pointers, the IPU DMA (IDMAC) does not support scatter and gather, which means that in this case the memory needs to be physically contiguous (and page/cache aligned). Which in turn is a problem since only "drivers" can allocate contiguous buffers. The only driver/api I can recall which does just that is cmem.
So is there some other way to use the cache, of which I have not thought yet or that I overlooked?
Best regards

Related

Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?

What really happens on modern PC hardware booted in 16-bit legacy BIOS MBR mode when you store a byte such as '1' (0x31) into the VGA text (mode 03) framebuffer at physical linear address B8000? How slow is a mov [es:di], eax store with the MTRR for that region set to UC? (Experimental testing on one Kaby Lake iGPU laptop indicates that clflushopt on WC was roughly the same speed as UC for VGA memory. But without clflushopt, mov stores to WC memory never leave the CPU and don't update the screen at all, running super fast.)
If it's not an SMI for every store, is there any way to approximate this cost on a chunk of WB memory in user-space, for performance experiments without actually rebooting into real mode? (e.g. using a BSS page as a pretend framebuffer that doesn't actually display anywhere).
The corresponding font glyph appears on screen in the next refresh, but is hardware scan-out really reading that ASCII char from VRAM (or DRAM for an iGPU) and mapping to bitmap font glyphs on the fly? Or is there some software interception on each store or once per vblank so the real hardware only has to handle a bitmapped framebuffer?
Legacy BIOS booting is well known to use System Management Mode (SMM) to emulate USB kbd/mouse as a PS/2 devices. I'm wondering if it's also used for the VGA text mode framebuffer. I assume it is used for VGA I/O ports for mode-setting but it's plausible that a text framebuffer could be supported by hardware. However, most computers spend all their time in graphics mode so leaving out HW support for text mode seems like something vendors might want to do. (OTOH this blog suggests that a homebrew verilog VGA controller can implement text mode fairly simply.)
I'm specifically interested in systems using the iGPU in Intel Skylake, but would be interested in earlier / later iGPUs from Intel and AMD, and new or old discrete GPUs.
(Including vendors other than AMD and NVidia; there are some Skylake motherboards with PCI slots, not PCIe. If modern GPU firmware drivers do emulate text mode, presumably there are some old PCI video cards with hardware VGA text mode. And maybe such a card could make stores just be a PCI transaction instead of an SMI.)
My own desktop is an i7-6700k in an Asus Z170 Pro Gaming mobo, no add-on cards just iGPU with a 1920x1200 monitor on the DVI-D output. I don't know the details of the Kaby Lake i5-7300HQ system #Eldan is testing on, only the CPU model.
I found Phoenix BIOS's patent US20120159520 from 2011,
Emulating legacy video using uefi. Instead of requiring video hardware vendors to supply both UEFI and native 16-bit real mode option-ROM drivers, they propose a real-mode VGA driver (int 10h functions and so on) that calls a vendor-supplied UEFI video driver via SMM hooks.
Abstract
[...] The generic video option ROM notifies a generic video SMM driver of the request for video services. Such notification may be performed using a software system management interrupt (SMI). Upon notification, the generic video SMM driver notifies a third party UEFI video driver of the request for video services. The third party video driver provides the requested video services to the operating system. In this way, a third party UEFI graphics driver may support a wide variety of operating systems, even those that do not natively support the UEFI display protocols.
Much of the description covers handling int 10h calls and stuff like that which already obviously trap through the IVT, thus can easily run custom code that triggers an SMI on purpose. The relevant part is what they describe for direct stores into the text-mode framebuffer which need to work even for code that doesn't trigger any software or hardware interrupts. (Other than HW triggering SMI on such stores, which they say they can use if supported.)
Text Buffer Support
[0066] In certain embodiments, applications may manipulate the VGA's
text buffer directly. In such an embodiment, generic video SMM driver
130 support this in one of two ways, depending on whether the hardware
provides SMI trapping on read/write access to the 740 KB-768 KB memory
region (where the text buffers are located).
[0067] When SMI trapping is available, the hardware generates an SMI
on each read or write access. Using the trap address of the SMI trap,
the exact text column and row may be calculated and the corresponding
row and column in the virtual text screen accessed.
Alternately,
normal memory is enabled for this region and, using a periodic SMI,
generic video SMM driver 130 scans for changes in the emulated
hardware text buffer and updates the corresponding virtual text screen
maintained by the video driver. In both cases, when a change is
detected, the character is redrawn on the virtual text screen.
This is just one BIOS vendor's patent, and doesn't tell us which way most hardware actually works, or if other vendors do different things. It does essentially confirm that some hardware exists which can trap on stores in that range, though. (Unless that's just a hypothetical possibility that they decided to cover in their patent.)
For the use-case I have in mind, trapping only on screen refresh would be vastly faster than trapping on every store so I'm curious which hardware / firmware works which way.
Motivation for this question
Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core - repeatedly storing new digits for an ASCII text counter into the same few bytes of video RAM.
I tested a version of the code in 32-bit user-space under Linux, on WB memory, hoping to approximate the situation with movnti and different ways of getting the CPU to sync its WC buffer to video RAM after each store (or perhaps occasionally in a timer interrupt). But this is not realistic if the real-mode bootloader situation isn't just storing to DRAM, but instead triggering an SMI.
On WB memory, flushing movnti stores with a lock xor byte [esp], 0 is somewhat faster than flushing with clflushopt. But #Eldan reports no speed improvement for those on VGA memory after programming an MTRR to make it WC. (And the same speed as for the original doing normal stores, indicating that by default the VGA framebuffer was UC. Some older BIOSes had an option to make VGA memory WC, which they called USWC = Uncached Speculative Write Combining.)
It's not a real-world problem so I'm not looking for actual workarounds; although it would be interesting to know if manually storing pixel bytes into a VGA graphics mode could be much faster.
Summary
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory? So we can easily profile with perf for performance counters.
If different BIOSes and/or hardware use different strategies, what are those strategies? (I don't want details, just a high level like "SMI every vblank to sync the VGA framebuffer to the actual hardware framebuffer")
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do? I'm guessing an actual PCIe write transaction would be slower than waiting for a store to hit DRAM, but that a PCIe write would be cheaper than an SMI on every store. A ballpark / order of magnitude comparison would be interesting.
These questions are all highly related, but I can split this up if there isn't as much overlap as I expect.
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
For video cards, I very much doubt it. Video card manufacturers have had the "get pixel data from char+attribute" logic built into hardware since the 1980s (it predates VGA and hasn't changed much since CGA), and just cut&paste that logic into each newer design without caring much about it.
For things that are not video cards at all (e.g. remote system management tools using LAN) I don't know but suspect not (often they use a special management CPU rather than the main CPU/s so that it works even if the computer is turned "off").
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory?
If you're not in user-space, you can change MTTRs (on all CPUs - MTRRs must match and there's a special sequence involved) to make an area of RAM "uncached"; or use PAT in the page tables (much easier than messing with MTRRs, especially if you're using paging anyway, but slightly different behavior due to still needing cache coherency). If you are in user-space then you will have to rely on whatever the OS/kernel provides, and (depending on which OS it is) the OS/kernel may not provide any way to do this at all.
However; even if you find a way to make (an area of) RAM uncached it still won't be very similar, because you'll be writing directly to something attached to a memory controller built into the CPU (that CPU can write to extremely quickly) instead of talking to something at the other end of a PCI link (that will have higher latency and lower bandwidth from CPU's side). Even for integrated video (where it's technically the same RAM chips in the end) writes to VRAM go through a very different path (subject to remapping/GART/paging in the video card, effected by a "write mode" VGA register, effected by bit/plane mask VGA registers, etc).
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do?
For writes from CPU to VRAM; typically integrated video is significantly faster than discrete cards (at least for plain writes from CPU to linear frame buffers where none of the VGA's "write logic" is involved).
For extremely rough ballpark estimates; I'd expect a single write to RAM to be around 150 cycles and a single write to PCI to be close to 1000 cycles. For SMI I'd expect a few hundred cycles of latency before SMI arrives at CPU, then the cost of CPU pipeline flush, then about 500 cycles to save CPU's state (and same loading state on the return path); then the firmware's code would have to find the cause of the SMI (another few hundred cycles?) before it could know it was a write to VRAM and not something else; then it'd have to examine the saved CPU state and find and decode the instruction that made the write (because it can't know what data was being written, if it was a byte/word/dword write, etc) while taking into account previous CPU state (which mode CPU was in, code size, etc) and keeping track of how emulating the instruction effects the future CPU state (advancing RIP, etc - don't forget that they'll be emulating every instruction that can cause a write, including things like XADD, etc). Next it would have to analyze the state of (emulated) VGA registers (write mode, write mask, plane enable, whatever controls which 64 KiB bank is mapped into the legacy area, font height, ...). Basically; for SMI emulation of a write to text mode frame buffer; I'd expect it to take tens of thousands of cycles before the firmware's code overlooks a minor but important detail buried among a huge amount of complexity, causing it to do the wrong thing and be unusably broken.
Other Notes
I found Phoenix BIOS's patent US20120159520 from 2011, Emulating legacy video using uefi.
I doubt this was ever implemented, because I doubt it can ever work. There's far too many (common and obscure) things you can do with the legacy interfaces (e.g. detect vertical refresh, setup non-standard video modes like "mode X", fiddle with "display start" to implement smooth scrolling and/or page flipping, use "CRTC info" in VBE to alter video timings, etc) that isn't supported by UEFI and can't be done via. a third party video driver for UEFI.
Instead, video card manufacturers didn't bother providing UEFI drivers for about 10 years and UEFI firmware used the legacy interface to emulate UEFI services (often breaking secure boot while they were at it); until almost everything was UEFI anyway.
I assume it (SMM) is used for VGA I/O ports for mode-setting.
I assume not. The only thing vaguely related to video that I'd suspect SMM may be used for is controlling the brightness of the screen's backlight in laptops (especially for older laptops, and especially for "lid open/close events") during early boot (before OS takes over).
.. leaving out HW support for text mode seems like something vendors might want to do
I still believe that the (eventual, after the already too long "hybrid BIOS+UEFI" transition phase) removal of 30+ years of accumulated legacy mess (A20, VGA, PS/2, PIT, PIC, ...) from hardware is one of the main reasons hardware manufacturers (Intel) are/have been pushing for UEFI adoption.
Reading through various modern Intel CPU and Platform Controller Hub (PCH) datasheets, it doesn't appear that the necessary hardware is implemented. There doesn't seem to be any way to generate an SMI (System Management Interrupt) in response to processor accesses of the VGA frame buffer (physical addresses 0xA0000 - 0xBFFFF).
The memory controller in the CPU will either route accesses to VGA frame buffer to the integrated graphics controller, the PCI Express port connected directly to the CPU, or the DMI interface connecting the CPU to the PCH. While it's possible route parts VGA frame buffer separately, this appears only meant to support a separate MDA (Monochrome Display Adapter) device. The integrated graphics controller is not well documented so it's possible that it can be configured to generate an SMI on VGA frame buffer accesses, but this seems unlikely. In any case, it wouldn't work with discrete graphics.
Intel PCH's also don't seem to have any support for generating SMIs in response to VGA frame buffer accesses. This would be the most natural place for it, as it already has support for generating SMIs in response to I/O accesses to the keyboard controller, IDE controller and other legacy devices. It possible that there's some undocumented feature that does this, but it's not included in the lists of possible SMI sources given in the PCH datasheets.
Theoretically, it would be possible for a motherboard manufacture to connect a fake VGA device to the PCH through a PCI Express port and then generate SMIs using a PCH GPIO pin. However, I'm not sure this will work in practice. By the time the CPU gets the SMI it could have moved on to executing other instructions and it wouldn't be possible to examine the CPU state at the time of the frame buffer access.
(A similar problem happened with SoundBlaster 16 emulation on the SoundBlaster Live. It would generate a PCI SERR# when the legacy SoundBlaster ports were accessed, which would generate a NMI on the CPU. Unfortunately the emulation would break on many Pentium 4 motherboards because the NMI would arrive on the next or subsequent instruction.)

Memory copy is taking more time on GPU compared to CPU

I have a source and destination pointers of the image to copy. When I run the code for the copy on CPU, its taking 2ms.
Now,I ran code on open cl with:
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,src_ptr,errcode_ret)
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,dst_ptr,errcode_ret)
and written kernel with global workgroup size(w,H).so, each kernel is copying a pixel. It's about 20ms.
Can someone please help me, how to efficiently do memory copy on open cl when we have image pointers to global memory.what is proper workgroup size to use for this process?
Can you help clarify what you're trying to accomplish? Are you trying to compare the time it takes to memcpy a host buffer to the time it takes to copy a device buffer using a GPU kernel?
If so, try allocating the buffer without the CL_MEM_USE_HOST_PTR flag. From the first response here it seems like some implementations map that buffer to system memory instead of device memory, which could slow down the copy kernel.
how to efficiently do memory copy on open cl when we have image
pointers to global memory
The efficient way is to use memcpy() on the host pointers. IOW use the CPU.
when we use CL_MEM_USE_HOST_PTR, GPU can access the image directly from global memory instead of copying from global memory
That's not strictly true. It's true for integrated GPUs (if the host_ptr memory pointer is properly aligned). Discrete GPUs will still copy host memory to their own memory over the PCI express bus. If you read the documentation for clCreateBuffer, it says:
CL_MEM_USE_HOST_PTR ... OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.
Discrete GPUs cannot directly "work" on host memory. Even if they could, it would be so slow as to be pointless.
In fact using CL_MEM_USE_HOST_PTR with a discrete GPU may result in worse performance, because the GPU will have to keep the host copy in sync with its own copy, which will result in a lot of PCIe transfers. CL_MEM_USE_HOST_PTR only makes sense with integrated GPUs to save unnecessary transfers and memory copies.
Generally the way you work with GPUs is to minimize memory transfers, so you create buffers once (with clCreateBuffer), then launch the kernels you need on them, and then either transfer result back to host (via enqueueReadImage) or display it with OpenGL interop. You'll have to clarify what you're doing if you want more useful advice.

Can the Rx/Tx Packet Buffer size be changed dynamically on a Linux NIC driver?

At the moment, the transmit and receive packet size is defined by a macro
#define PKT_BUF_SZ (VLAN_ETH_FRAME_LEN + NET_IP_ALIGN + 4)
So PKT_BUF_SZ comes to around 1524 bytes. So the NIC I am having can handle incoming packets from the network which are <= 1524. Anything bigger than that causes the system to crash or worse reboot. Using Linux kernel 2.6.32 and RHEL 6.0, and a custom FPGA NIC.
Is there a way to change the PKY_BUF_SZ dynamically by getting the size of the incoming packet from the NIC? Will it add to the overhead? Should the hardware drop the packets before it reaches the driver ?
Any help/suggestion will be much appreciated.
This isn't something that can be answered without knowledge of the specific controller. They all work differently in details.
Some broadcom NICs for example have different-sized pools of buffers from which the controller will select an appropriate one based on the frame size. For example, a pool of small (256) byte buffers, a pool of standard size (1536 or so) buffers, and a pool of jumbo buffers.
Some intel NICs have allowed a list of fixed size buffers together with a maximum frame size and it will then pull as many consecutive buffers as needed (not sure linux ever supported this use though -- it's much more complicated for software to handle).
But the most common model that most NICs use (and in fact, I believe all of the commercial ones can be used this way): they expect an entire frame to fit in a single buffer, and your single buffer size needs to accommodate the largest frame you will receive.
Given that your NIC is a custom FPGA one, only its designers can advise you on the specifics you're asking. If linux is crashing when larger packets come through, then most likely either your allocated buffer size is not as large as you are telling the NIC it is (leading to overflow), or the NIC has a bug that is causing it to write into some other memory area.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

Is a buffer within kmalloc also a DMA safe buffer?

I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space

Resources