I want to understand how a CPU works and so I want to know how it communicates with a PCIe card.
Which instructions does the CPU use to initialize a PCIe port and than read and write to it?
For example OUT or MOV.
A CPU mainly communicates with PCIe cards through memory ranges they expose. This memory may be small for network or sound cards, and very large for graphics cards. Integrated GPUs have also have their own tiny memory but share most of the main memory. Most other cards also have read/write access to main memory.
To set up the PCIe device, the configuration space is written to. On x86, the BIOS or bootloader will provide the location of this data. PCI devices are connecting in a tree which may include hubs and bridges on larger computers and this can be shown in lspci -t. Thunderbolt can even connect to external devices. This is why the OS needs to recursively "probe" the tree to find PCI devices and configure them.
Synchronization uses interrupts and ring buffers. The device can send a prenegotiated interrupt to the CPU when it's done doing work. The CPU writes work to a ring buffer. It then writes another memory location that contains the head pointer. This memory location is located on the device so it can listen to writes there and wake up when there is work to do.
Most of the interaction for modern devices will use MOV instead of OUT. The I/O ports concept is very old and not very suitable for the massive amount of data on modern systems. Having devices expose their functionality as a type of memory instead of a separate mechanism allows vectorized variants of MOV to move 32 bytes or similar at a time. With graphics card and modern network cards supporting offload, they can also use their own hardware to write results back to main memory when instructed to do so. The CPU can then read the results when it's free later, again using MOV.
Before this memory access works, the OS will need to set up the memory mapping properly. The memory mapping is set in the PCI configuration space as BARs. On the CPU side it is set up in the page tables. CPUs usually have caches to keep data locally because access to RAM is slower. This causes a problem when the data needs to get to a PCI device, so the OS will set certain memory as write-through or even uncacheable so this is ensured.
The word BAR is often marketed by GPU vendors. What they are selling is the ability to map a larger region of memory at a time. Without that, OSes have been just unmapping and reinitializing by remapping a limited window of memory at a time. This exemplifies the importance of MOV accessing PCIe devices.
I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.
The other day I was reading an article where the author was talking about DMA, and how it helps copy packets across the PCI bus into memory, without the CPU being involved.
Then it says:
The only overhead is that about once a millisecond, the CPU needs to wake up and tell the driver which packet buffers are free.
This part I don't quite understand -- why would the CPU tell driver about available buffers and how exactly this works? Any link/reference would be greatly appreciated.
Thanks.
Once the driver's transmit(), etc. function is called, the hardware "owns" the memory. Without the behavior you describe, that memory would be leaked. So the DMA subsystem informs the driver / relevant subsystem that the hardware is "finished" accessing the memory. At that point it can be reclaimed for use by someone else.
In linux kernel, given a process and its virtual memory space, is there a way to find the memory regions that are mapped for DMA (Direct Memory Access)? Maybe from the flags of its vma_area_struct?
Thanks
Well, you could find out which pages are locked.
But the fact that a page is locked does not necessarily imply that it is for DMA.
If the DMA mappings are created by your driver, it is much easier to implement a proper book keeping instead of looking for DMA regions after-the-fact.
I am writing a device driver in Linux for which I need to implement DMA.
It is clear that DMA buffers can be allocated by a call to pci_alloc_consistent(). But how can we write commands to those buffers from user level?
Tasks include writing values to specific registers, how are these implemented using DMA commands?
I believe you can write with DMA through I/O operations that you may access through a GNU C library . You must use system calls such as ioperm or iopl and run as root to gain access to DMA registers. At least thats how one gains access to IO space which may be used for DMA access. Though I may not answer the question completely, hopefully this points you in a good direction.