Linux - Streaming DMA - Explicit flush/invalidate - caching

The documentation on the Streaming DMA API mentions that in order to ensure consistency, the cache needs to be flushed before dma-mapping to device, and invalidated after unmapping from device.
However, I confused if the flush and invalidate needs to be performed explicitly, i.e., Do the functions dma_map_single() & dma_sync_single_for_device() already take care of flushing the cachelines, or does the driver develop need to call some function to explicitly flush the cachelines of the dma buffer? Same goes for dma_unmap_single() & dma_sync_single_for_cpu()..do these 2 functions automatically invalidate the dma-buffer cache lines?
I skimmed through some existing drivers that use streaming dma and I can't see any explicit calls to flush or invalidate the cachelines.
I also went through the kernel source code and it seems that the above mentioned functions all 'invalidate' the cachelines in their architecture specific implementations, which further adds to my confusion..e.g., in arch/arm64/mm/cache.S
SYM_FUNC_START_PI(__dma_map_area)
add x1, x0, x1
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
SYM_FUNC_END_PI(__dma_map_area)
Can someone please clarify this? Thanks.

So, based on received comments and some more findings, I thought to answer this question myself for others with similar queries. The following is specific for ARM64 architecures. Other architectures may have a slightly different implementation.
When using the Streaming DMA API, one does NOT have to explicitly flush or invalidate the cachelines. Functions dma_map_single(), dma_sync_single_for_device(), dma_unmap_single(), dma_sync_single_for_cpu() take care of that for you. E.g. dma_map_single() and dma_sync_single_for_device() both end up calling architecture dependent function __dma_map_area
ENTRY(__dma_map_area)
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
ENDPIPROC(__dma_map_area)
In this case, if the direction specified is DMA_FROM_DEVICE, then the cachelines are invalidated (because data must have come from the device to memory and the cachelines need to be invalidated so that any read from CPU will fetch the new data from memory). If direction is DMA_TO_DEVICE/BIDIRECTIONAL then a flush operation is performed (because data could have been written by the CPU and so the cached data needs to be flushed to the memory for valid data to be written to the device).
NOTE that the 'clean' in __dma_clean_area is ARM's nomenclature for cache flush.
Same goes for dma_unmap_single() & dma_sync_single_for_cpu() which end up calling __dma_unmap_area() which invalidates the cachelines if the direction specified is not DMA_TO_DEVICE.
ENTRY(__dma_unmap_area)
cmp w2, #DMA_TO_DEVICE
b.ne __dma_inv_area
ret
ENDPIPROC(__dma_unmap_area)
dma_map_single() and dma_unmap_single() are expensive operations since they also include some additional page mapping/unmapping operations so if the direction specified is to remain constant, it is better to use the dma_sync_single_for_cpu() and dma_sync_single_for_device() functions.
On a side note, for my case, using Streaming DMA resulted in ~10X faster read operations compared to Coherent DMA. However, the user code implementation gets a little more complicated because you need to ensure that the memory is not accessed by the cpu while dma is mapped to the device (or that the sync operations are called before/after cpu access).

Related

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?
dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.
I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

STM32H7 MPU shareable memory attribute and strongly ordered memory type

I am confused by some of the attributes of the STM32H7 MPU.
I've read several documents: STM32H7 reference and programming manual, STMicro application note on MPM, etc...
I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct ?
I need to define a MPU region for a QSPI Flash memory. A document from MicroChip (reference TB3179) indicates that the QSPI memory should be configured as Strongly Ordered. I don't really understand why ?
Question: I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct?
Here's an ST guide to MPU configuration:
https://www.st.com/content/st_com/en/support/learning/stm32-education/stm32-moocs/STM32_MPU_tips.html
If some area is Cacheable and Shareable, only instruction cache is used in STM32F7/H7
As STM32 [F7 and H7] microcontrollers don't contain any hardware
feature for keeping data coherent, setting a region as Shareable
means that data cache is not used in the region. If the region is not
shareable, data cache can be used, but data coherency between bus
masters need to be ensured by software.
Shareable on STM32H7 seems to be implicitly synonymous with non-cached access when INSTRUCTION_ACCESS_DISABLED (Execute Never, code execution disabled).
Furthermore,
https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5468/shareability-memory-attribute
The sharability attribute tells the processor it must do whatever
is necessary to allow that data to be shared. What that really
means depends on the features of a particular processor.
On a processor with multi-CPU hardware cache coherency; the
shareability attribute is a signal to engage the cache coherency logic.
For example A57 can maintain cache-coherency of shareable data within
the cluster and between clusters if connected via a coherent
interconnect.
On a processor without hardware cache coherency, such as Cortex-A8, the only way to share the data is to push it out of the
cache as you guessed. On A8 shareable, cacheable memory ends up
being treated as un-cached.
Someone, please correct me if I'm wrong - it's so hard to come by definitive and concise statements on the topic.
Question: I need to define an MPU region for a QSPI Flash memory.
QSPI memory should be configured as Strongly Ordered. I don't really understand why?
The MPU guide above claims at least two points: prevent speculative access and prevent writes from being fragmented (e.g. interrupted by reading operations).
Speculative memory read may cause high latency or even system error
when performed on external memories like SDRAM, or Quad-SPI.
External memories even don't need to be connected to the microcontroller,
but its memory range is accessible by speculative read because by
default, its memory region is set as Normal.
Speculative access is never made to Strongly Ordered and Device memory
areas.
Strongly Ordered memory type is used in memories which need to have each write be a single transaction
For Strongly Ordered memory region CPU waits for the end of memory access instruction.
Finally, I suspect that alignment can be a requirement from the memory side which is adequately represented by a memory type that enforces aligned read/write access.
https://developer.arm.com/documentation/ddi0489/d/memory-system/axim-interface/memory-system-implications-for-axi-accesses
However, Device and Strongly-ordered memory are always Non-cacheable.
Also, any unaligned access to Device or Strongly-ordered memory
generates alignment UsageFault and therefore does not cause any AXI
transfer. This means that the access examples are given in this chapter
never show unaligned accesses to Device or Strongly-ordered memory.
UsageFault : Without explicit configuration, UsageFault defaults to calling the HardFault handler. Differentiated error handling needs to be enabled in SCB System Handler Control and State Register first:
SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk // will also be set by HAL_MPU_Enable()
| SCB_SHCSR_BUSFAULTENA_Msk
| SCB_SHCSR_USGFAULTENA_Msk;
UsageFault handlers can evaluate UsageFault status register (UFSR) described in https://www.keil.com/appnotes/files/apnt209.pdf.
printf("UFSR : 0x%4x\n", (SCB->CFSR >> 16) & 0xFFFF);

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?
TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Flush cache to DRAM

I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.
I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.
The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...
As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.
In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?
Thanks.
The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.
To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.
static inline void dcache_clean(void)
{
const int zero = 0;
/* clean entire D cache -> push to external memory. */
__asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
" bne 1b\n" ::: "cc");
/* drain the write buffer */
__asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
}
You may need more if you have an L2 cache.
To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,
proc-arm926.S
cache-v7.S
cache-v6.S
etc
These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.
To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX() routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).
I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:
#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);
start is a kernel virtual space pointer. You also need to flush L1 to L2:
__cpuc_flush_dcache_area(start,size);
I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...
Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed.
(For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)
I'm not familiar with Zynq, but you essentially have two options that really work:
either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.
If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.

Invalidating the CPU's cache

When my program performs a load operation with acquire semantics/store operation with release semantics or perhaps a full-fence, it invalidates the CPU's cache.
My question is this: which part of the cache is actually invalidated? only the cache-line that held the variable that I've used acquire/release? or perhaps the entire cache is invalidated? (L1 + L2 + L3 .. and so on?). Is there a difference in this subject when I use acquire/release semantics, or when i use a full-fence?
When you perform a load without fences or mutexes, then the loaded value could potentially come from anywhere, i.e, caches, registers (by way of compiler optimizations), or RAM... but from your question, you already knew this.
In most mutex implementations, when you acquire a mutex, a fence is always applied, either explicitly (e.g., mfence, barrier, etc.) or implicitly (e.g., lock prefix to lock the bus on x86). This causes the cache-lines of all caches on the path to be invalidated.
Note that the entire cache isn't invalidated, just the respective cache-lines for the memory location. This also includes the lines for the mutex (which is usually implemented as a value in memory).
Of course, there are architecture-specific details, but this is how it works in general.
Also note that this isn't the only reason for invalidating caches, as there may be operations on one CPU that would need caches on another one to be invalidated. Doing a google search for "cache coherence protocols" will provide you with a lot of information on this subject.

Resources