DMA in Linux and CPU-driver interaction - linux-kernel

The other day I was reading an article where the author was talking about DMA, and how it helps copy packets across the PCI bus into memory, without the CPU being involved.
Then it says:
The only overhead is that about once a millisecond, the CPU needs to wake up and tell the driver which packet buffers are free.
This part I don't quite understand -- why would the CPU tell driver about available buffers and how exactly this works? Any link/reference would be greatly appreciated.
Thanks.

Once the driver's transmit(), etc. function is called, the hardware "owns" the memory. Without the behavior you describe, that memory would be leaked. So the DMA subsystem informs the driver / relevant subsystem that the hardware is "finished" accessing the memory. At that point it can be reclaimed for use by someone else.

Related

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of "Pinned or page-locked" memory? Which are the equivalent in OpenCL?

I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this?
Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.
What is the meaning of "pin the memory"?
It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.
Why is it so fast? 
In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.
Are the PCIe transaction managed entirely from the CPU?
No. See above.
Is there a manager of a bus that takes care of this?
No. The GPU manages the transfers. In this context there is no such thing as a bus master
EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.
I bumped into this question while looking for something else. For all future users, I think #talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address.
Here's a reference to the same.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

Why do we have memory zones in linux?

I was reading this on a page that:
Because of hardware limitations, the kernel cannot treat all pages as identical. Some pages, because of their physical address in memory, cannot be used for certain tasks. Because of this limitation, the kernel divides pages into different zones.
I was wondering about those hardware limitation. Can somebody please explain me those hardware limitation and give an example. As well, is there any software guide from intel explaining this?
Also, I read that virtual memory is divided into two parts 1GB for kernel space and 3GB for user space. Why do we give 1GB space in the virtual space of all processes to kernel? How is it mapped to actual physical pages? Can somebody please point me to a clean text explaining this?
Thanks in advance.
The hardware limitations mostly concern old devies. For example, you have the ZONE_DMA, which is from 0 - 16MB. This is e.g. needed for older ISA Devices, which are not capable of adressing above the 16MB limit. Then you have the ZONE_NORMAL, where most of the kernel operations take place and is adressed permanently into the kernels adress space.
The 1GB and 3GB split is simple. You have virtual adresses here, so for your application, the memory adress always starts at 0x00000000, reserved are the 1st GB of this for kernel stuff. Why this is done is pretty simple: You have the kernel mode and the user mode. In kernel mode you are allowed to use system calls. If you would not have the kernel memory mapped to your virtual adress space, you would have to do a context switch to trap you into kernel mode (context switch: Save current context to memory, load another context from memory -> time consuming). But as kernel-mode operations can take place in the same virtual adress space, you dont need to switch the context to, for example, allocate new memory or do any other system call.
for your second question about 1GB kernel mapping in user space for a processor
kernel is mapped of course for time saving by not having switch. 1 GB is for kernel functionality so that if kernel maps new memory for its functionality kernel can do that. any book on Unix can give you details

Is DMA write/read with respective to memory or device?

Recently I have heard several version of DMA write/read operations, and I didn't have any luck searching google. Some claim it's device-centric while some say it's memory-centric. When people usually say DMA write, is DMA write to the device or device write to memory? Thanks for clarifying the convention.
Let me give you an empirical example !
I have had validated a PCIe based DMA in past. This PCIeDMA was a module of the slave device that was connected over PCIe bus to the x86 host. The slave device had a memory of 800KB. In our case 'DMA Write'meant - The DMA reads the data from the memory of slave device and transfers/writes it to the host memory and vice versa for 'DMA Read'.
Hope this clears your understanding.
At the risk of gross oversimplification . . .
In a DMA system, the device can access memory just as the CPU[s] can. Or you can view the devices as being just another processor.
In your terms, usually it is writing to memory.

DMA vs Cache difference

Probably a stupid question for most that know DMA and caches... I just know cache stores memory to somewhere closer to where you can access so you don't have to spend as much time for the I/O.
But what about DMA? It lets you access that main memory with less delay?
Could someone explain the differences, both, or why I'm just confused?
DMA is a hardware device that can move to/from memory without using CPU instructions.
For instance, a hardware device (lets say, your PCI sound device) wants audio to play back. You can either:
Write a word at a time via a CPU mov instructions.
Configure the DMA device. You give it a start address, a destination, and the number of bytes to copy. The transfer now occurs while the CPU does something else instead of spoon feeding the audio device.
DMA can be very complex (scatter gather, etc), and varies by bus type and system.
I agree fully with the first answer, and there are some common additions...
On most DMA hardwares you can also set it up to do memory to memory transfers - there are not always external devices involved. Also depending on the system you may or may not need to sync the CPU-cache in software before (or after the transfer), since the data the DMA transfers into/from memory may be done without the knowledge of the CPU-cache.
The benefit of doing any DMA is that the CPU(s) is/are able to do other things simultaneously.
Of course when the CPU also needs to access the memory, only one can gain access and the other must wait.
Mem to mem DMA is often used in embedded systems to increase performance, or may be vital to be able to access some parts of the memory at all.
To answer the question, DMA and CPU-cache are totally different things and not comparable.
I know its a bit late but answering this question will help someone like me I guess, Agreeing with the above answers, I think the question was in relation to cache.
So Yes a cache does store information somewhere closer to the memory, this could be the results of earlier computations. Moreover, whenever a data is found in cache (called a cache hit) the value is used directly. when its not found (called a cache-miss), the processor goes on to calculate the required value. Peripheral Devices (SD cards, USBs etc) can also access this data, which is why on startup we usually invalidate cache data so that the cache line is clean. We also flush cache data on startup so that all the cache data is written back to the main memory for cpu to use, after which we proceed to reset or initialize the cache.
DMA (Direct Memory Access), yes it does let you access the main memory. But I think the better definition is, it lets you access the system register, which can only be accessed by the processor. #Ronnie and #Yann Ramin were both correct in that DMA can be a device hardware, so it can be used by your serial peripheral to access system registers, but it can also be used for memory to memory transfers between two cores.
You can read up further on DMA from wikipedia, about the modes in which DMA can access the system memory. I ll explain it simply
Burst mode: DMA takes full control of the bus, CPU is idle during this time. Data is transferred in burst (as a whole) without interruption.
Cycle stealing mode: In this data is transfered one byte at a time, transfer is slow, but CPU is not idle.

Resources