Probably a stupid question for most that know DMA and caches... I just know cache stores memory to somewhere closer to where you can access so you don't have to spend as much time for the I/O.
But what about DMA? It lets you access that main memory with less delay?
Could someone explain the differences, both, or why I'm just confused?
DMA is a hardware device that can move to/from memory without using CPU instructions.
For instance, a hardware device (lets say, your PCI sound device) wants audio to play back. You can either:
Write a word at a time via a CPU mov instructions.
Configure the DMA device. You give it a start address, a destination, and the number of bytes to copy. The transfer now occurs while the CPU does something else instead of spoon feeding the audio device.
DMA can be very complex (scatter gather, etc), and varies by bus type and system.
I agree fully with the first answer, and there are some common additions...
On most DMA hardwares you can also set it up to do memory to memory transfers - there are not always external devices involved. Also depending on the system you may or may not need to sync the CPU-cache in software before (or after the transfer), since the data the DMA transfers into/from memory may be done without the knowledge of the CPU-cache.
The benefit of doing any DMA is that the CPU(s) is/are able to do other things simultaneously.
Of course when the CPU also needs to access the memory, only one can gain access and the other must wait.
Mem to mem DMA is often used in embedded systems to increase performance, or may be vital to be able to access some parts of the memory at all.
To answer the question, DMA and CPU-cache are totally different things and not comparable.
I know its a bit late but answering this question will help someone like me I guess, Agreeing with the above answers, I think the question was in relation to cache.
So Yes a cache does store information somewhere closer to the memory, this could be the results of earlier computations. Moreover, whenever a data is found in cache (called a cache hit) the value is used directly. when its not found (called a cache-miss), the processor goes on to calculate the required value. Peripheral Devices (SD cards, USBs etc) can also access this data, which is why on startup we usually invalidate cache data so that the cache line is clean. We also flush cache data on startup so that all the cache data is written back to the main memory for cpu to use, after which we proceed to reset or initialize the cache.
DMA (Direct Memory Access), yes it does let you access the main memory. But I think the better definition is, it lets you access the system register, which can only be accessed by the processor. #Ronnie and #Yann Ramin were both correct in that DMA can be a device hardware, so it can be used by your serial peripheral to access system registers, but it can also be used for memory to memory transfers between two cores.
You can read up further on DMA from wikipedia, about the modes in which DMA can access the system memory. I ll explain it simply
Burst mode: DMA takes full control of the bus, CPU is idle during this time. Data is transferred in burst (as a whole) without interruption.
Cycle stealing mode: In this data is transfered one byte at a time, transfer is slow, but CPU is not idle.
Related
I am confused by some of the attributes of the STM32H7 MPU.
I've read several documents: STM32H7 reference and programming manual, STMicro application note on MPM, etc...
I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct ?
I need to define a MPU region for a QSPI Flash memory. A document from MicroChip (reference TB3179) indicates that the QSPI memory should be configured as Strongly Ordered. I don't really understand why ?
Question: I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct?
Here's an ST guide to MPU configuration:
https://www.st.com/content/st_com/en/support/learning/stm32-education/stm32-moocs/STM32_MPU_tips.html
If some area is Cacheable and Shareable, only instruction cache is used in STM32F7/H7
As STM32 [F7 and H7] microcontrollers don't contain any hardware
feature for keeping data coherent, setting a region as Shareable
means that data cache is not used in the region. If the region is not
shareable, data cache can be used, but data coherency between bus
masters need to be ensured by software.
Shareable on STM32H7 seems to be implicitly synonymous with non-cached access when INSTRUCTION_ACCESS_DISABLED (Execute Never, code execution disabled).
Furthermore,
https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5468/shareability-memory-attribute
The sharability attribute tells the processor it must do whatever
is necessary to allow that data to be shared. What that really
means depends on the features of a particular processor.
On a processor with multi-CPU hardware cache coherency; the
shareability attribute is a signal to engage the cache coherency logic.
For example A57 can maintain cache-coherency of shareable data within
the cluster and between clusters if connected via a coherent
interconnect.
On a processor without hardware cache coherency, such as Cortex-A8, the only way to share the data is to push it out of the
cache as you guessed. On A8 shareable, cacheable memory ends up
being treated as un-cached.
Someone, please correct me if I'm wrong - it's so hard to come by definitive and concise statements on the topic.
Question: I need to define an MPU region for a QSPI Flash memory.
QSPI memory should be configured as Strongly Ordered. I don't really understand why?
The MPU guide above claims at least two points: prevent speculative access and prevent writes from being fragmented (e.g. interrupted by reading operations).
Speculative memory read may cause high latency or even system error
when performed on external memories like SDRAM, or Quad-SPI.
External memories even don't need to be connected to the microcontroller,
but its memory range is accessible by speculative read because by
default, its memory region is set as Normal.
Speculative access is never made to Strongly Ordered and Device memory
areas.
Strongly Ordered memory type is used in memories which need to have each write be a single transaction
For Strongly Ordered memory region CPU waits for the end of memory access instruction.
Finally, I suspect that alignment can be a requirement from the memory side which is adequately represented by a memory type that enforces aligned read/write access.
https://developer.arm.com/documentation/ddi0489/d/memory-system/axim-interface/memory-system-implications-for-axi-accesses
However, Device and Strongly-ordered memory are always Non-cacheable.
Also, any unaligned access to Device or Strongly-ordered memory
generates alignment UsageFault and therefore does not cause any AXI
transfer. This means that the access examples are given in this chapter
never show unaligned accesses to Device or Strongly-ordered memory.
UsageFault : Without explicit configuration, UsageFault defaults to calling the HardFault handler. Differentiated error handling needs to be enabled in SCB System Handler Control and State Register first:
SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk // will also be set by HAL_MPU_Enable()
| SCB_SHCSR_BUSFAULTENA_Msk
| SCB_SHCSR_USGFAULTENA_Msk;
UsageFault handlers can evaluate UsageFault status register (UFSR) described in https://www.keil.com/appnotes/files/apnt209.pdf.
printf("UFSR : 0x%4x\n", (SCB->CFSR >> 16) & 0xFFFF);
I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.
Could someone explain what is the difference between cache memory and scratchpad memory? I'm currently learning about computer architecture.
A scratchpad is just that a place to keep some stuff. Cache, is memory you talk through normally not talk at. Scratchpad is like a post it note, something you write something on and keep with you. Cache is paper you send off to someone else with instructions like a memo.
Cache can be in various places, layers (L1, L2, L3...). both scratchpad and cache are just sram in some chip, with an address and data bus and read/write/etc control signals. (as are many other things in a computer which may or may not be used for addressable ram). During boot, before the ram on the far side (slower ram side, processor being the near side) is initialized (eventually dram typically if you have a cache otherwise why have a cache) it may be possible to access the cache as addressable ram. It depends very much on the system/design though, there may be a control register that enables it to behave as a simple ram, or there may be a mode, or its normal mode may be such that so long as you dont address more than the size of the ram based on its alignment (perhaps a 32K ram between 32K boundaries) then it may not try to evict anything and generate bus cycles on the dram/slow/far side of the cache allowing you to use it as ram just like a scratchpad.
BUT, the normal use case for a cache is as an ideally invisible pathway to ram. You dont access the cache ram using cache addressing you use the address space of the ram beyond and the cache simply allows the processor to continue without waiting for the slow ram.
Talking about booting again, think about the kinds of things you need to do when booting, namely bringing up the dram controller, which is most definitely a non-trivial thing. Having some on chip memory allows you to if nothing else temporarily have some ram for a small stack and for some variables. You can for example us a compiler on a compiled language like C which needs at a minimum some ram for stack and variables. Depending on space you can put some program there too, likely running there much faster than from flash. The alternative to having no ram is likely having to write the dram init in assembly using only general purpose or other registers in the processor, taking a complicated task and making it that much more difficult. Once the main system ram is up, then you may or may not choose to not use the on chip (scratchpad) ram.
I would and do argue that if you want to test the dram to see if it is working then you need to not use that ram to test that ram, the test program should not run in nor use the ram under test. Having scratchpad ram on chip (or some other ram in the address space, perhaps video card ram for example) could be used for the dram test program. Unfortunately lots of folks will use the ram under test to hold the stack and program and variables and heap from the program doing the test, leaving important parts of the ram untested other than one or a small number of patterns.
We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.
The kmalloc() memory will be contiguous in physical space. The user space will definitely not (mlock() may result in closer to contiguous). If you have several SDRAM chips, it is possible that your memory controller allow pipelining or multiple issue reads/writes to different chips simultaneously. It may even be faster with multiple banks. vmalloc() will not use contiguous pages.Ref You should be able to write a test to swap kmalloc() with vmalloc(). If something has changed with the newer ARMs and the cache is not VIVT, the difference in physical addresses could cause cache (aliasing?) effects on some processors.
I do not think that the cache are setup differently for kernel memory versus user memory; at least with 2.6.34 variants; but they may come from different pools. Also, for a memcpy() a large cache is not needed; you just need enough to make sure the SDRAM will burst.
Another issue is peripherals. For instance, a large graphics buffer on one chip maybe stealing cycles via DMA. If you can change your machine file or device table to disable as many drivers as possible, this can be eliminated. This combined with the pipelining could account for the type of slow-down observed.
I believe this is a platform issue. If it was strictly Linux, I think that one of the millions of users may have encountered it. However, you haven't given a specific version of Linux. It could be an ARM based issue; so I tagged it as such. I think it is your platform/ARM combination; simply because others would observe this. Can you also provide a specific machine file or device table that your design was based upon and the Linux version.
Question 1:
Where exactly does the internal register and internal cache exist? I understand that when a program is loaded into main memory it contains a text section, a stack, a heap and so on. However is the register located in a fixed area of main memory, or is it physically on the CPU and doesn't reside in main memory? Does this apply to the cache as well?
Questions 2:
How exactly does a device controller use direct memory access without using the CPU to schedule/move datum between the local buffer and main memory?
Basic answer:
The CPU registers are directly on the CPU. The L1, L2, and L3 caches are often on-chip; however, they may be shared between multiple cores or processors, so they're not always "physically on the CPU." However, they're never part of main memory either. The general principle is that the closer memory is to the CPU, the faster and more expensive (and thus smaller) it is. Every item in the cache has a particular main memory address associated with it (however, the same slot can be associated with different addresses at different times). However, there is no direct association between registers and main memory. That is why if you use the register keyword in C (not that it's often necessary, since the compiler is usually a better optimizer), you can not use the & operator.
The DMA controller executes the transfer directly. The CPU watches the bus so it knows when changes are made "behind its back", which invalidate its cache(s).
Even though the CPU is the central processing unit, it's not the sole "mover and shaker". Devices live on buses, along with CPUs, and also RAM. Modern buses allow devices to communicate with RAM without involving the CPU. Some devices are programmed simply by making changes to pieces of RAM which devices poll. Device drivers may poll pieces of RAM that a device is writing into, but usually a CPU receives an interrupt from the device telling it that there's something in a piece of RAM that's ready to read.
So, in answer to your question 2, the CPU isn't involved in memory transfers across the bus, except inasmuch as cache coherence messages about the invalidation of cache lines are involved. Bear in mind that the scenarios are tricky. The CPU might have modified byte 1 on a cache line when a device decides to modify byte 40. Getting that dirty cache line out of the CPU needs to happen before the device can modify data, but on x86 anyway, that activity is initiated by the bus, not the CPU.