Flush cache to DRAM - linux-kernel

Flush cache to DRAM - linux-kernel

I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.
I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.
The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...
As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.
In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?
Thanks.

The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.
To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.
static inline void dcache_clean(void)
{
const int zero = 0;
/* clean entire D cache -> push to external memory. */
__asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
" bne 1b\n" ::: "cc");
/* drain the write buffer */
__asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
}
You may need more if you have an L2 cache.
To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,
proc-arm926.S
cache-v7.S
cache-v6.S
etc
These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.
To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX() routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).

I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:
#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);
start is a kernel virtual space pointer. You also need to flush L1 to L2:
__cpuc_flush_dcache_area(start,size);
I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...
Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed.
(For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)

I'm not familiar with Zynq, but you essentially have two options that really work:
either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.
If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.

Related

Linux - Streaming DMA - Explicit flush/invalidate

The documentation on the Streaming DMA API mentions that in order to ensure consistency, the cache needs to be flushed before dma-mapping to device, and invalidated after unmapping from device.
However, I confused if the flush and invalidate needs to be performed explicitly, i.e., Do the functions dma_map_single() & dma_sync_single_for_device() already take care of flushing the cachelines, or does the driver develop need to call some function to explicitly flush the cachelines of the dma buffer? Same goes for dma_unmap_single() & dma_sync_single_for_cpu()..do these 2 functions automatically invalidate the dma-buffer cache lines?
I skimmed through some existing drivers that use streaming dma and I can't see any explicit calls to flush or invalidate the cachelines.
I also went through the kernel source code and it seems that the above mentioned functions all 'invalidate' the cachelines in their architecture specific implementations, which further adds to my confusion..e.g., in arch/arm64/mm/cache.S
SYM_FUNC_START_PI(__dma_map_area)
add x1, x0, x1
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
SYM_FUNC_END_PI(__dma_map_area)
Can someone please clarify this? Thanks.

So, based on received comments and some more findings, I thought to answer this question myself for others with similar queries. The following is specific for ARM64 architecures. Other architectures may have a slightly different implementation.
When using the Streaming DMA API, one does NOT have to explicitly flush or invalidate the cachelines. Functions dma_map_single(), dma_sync_single_for_device(), dma_unmap_single(), dma_sync_single_for_cpu() take care of that for you. E.g. dma_map_single() and dma_sync_single_for_device() both end up calling architecture dependent function __dma_map_area
ENTRY(__dma_map_area)
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
ENDPIPROC(__dma_map_area)
In this case, if the direction specified is DMA_FROM_DEVICE, then the cachelines are invalidated (because data must have come from the device to memory and the cachelines need to be invalidated so that any read from CPU will fetch the new data from memory). If direction is DMA_TO_DEVICE/BIDIRECTIONAL then a flush operation is performed (because data could have been written by the CPU and so the cached data needs to be flushed to the memory for valid data to be written to the device).
NOTE that the 'clean' in __dma_clean_area is ARM's nomenclature for cache flush.
Same goes for dma_unmap_single() & dma_sync_single_for_cpu() which end up calling __dma_unmap_area() which invalidates the cachelines if the direction specified is not DMA_TO_DEVICE.
ENTRY(__dma_unmap_area)
cmp w2, #DMA_TO_DEVICE
b.ne __dma_inv_area
ret
ENDPIPROC(__dma_unmap_area)
dma_map_single() and dma_unmap_single() are expensive operations since they also include some additional page mapping/unmapping operations so if the direction specified is to remain constant, it is better to use the dma_sync_single_for_cpu() and dma_sync_single_for_device() functions.
On a side note, for my case, using Streaming DMA resulted in ~10X faster read operations compared to Coherent DMA. However, the user code implementation gets a little more complicated because you need to ensure that the memory is not accessed by the cpu while dma is mapped to the device (or that the sync operations are called before/after cpu access).

STM32H7 MPU shareable memory attribute and strongly ordered memory type

I am confused by some of the attributes of the STM32H7 MPU.
I've read several documents: STM32H7 reference and programming manual, STMicro application note on MPM, etc...
I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct ?
I need to define a MPU region for a QSPI Flash memory. A document from MicroChip (reference TB3179) indicates that the QSPI memory should be configured as Strongly Ordered. I don't really understand why ?

Question: I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct?
Here's an ST guide to MPU configuration:
https://www.st.com/content/st_com/en/support/learning/stm32-education/stm32-moocs/STM32_MPU_tips.html
If some area is Cacheable and Shareable, only instruction cache is used in STM32F7/H7
As STM32 [F7 and H7] microcontrollers don't contain any hardware
feature for keeping data coherent, setting a region as Shareable
means that data cache is not used in the region. If the region is not
shareable, data cache can be used, but data coherency between bus
masters need to be ensured by software.
Shareable on STM32H7 seems to be implicitly synonymous with non-cached access when INSTRUCTION_ACCESS_DISABLED (Execute Never, code execution disabled).
Furthermore,
https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5468/shareability-memory-attribute
The sharability attribute tells the processor it must do whatever
is necessary to allow that data to be shared. What that really
means depends on the features of a particular processor.
On a processor with multi-CPU hardware cache coherency; the
shareability attribute is a signal to engage the cache coherency logic.
For example A57 can maintain cache-coherency of shareable data within
the cluster and between clusters if connected via a coherent
interconnect.
On a processor without hardware cache coherency, such as Cortex-A8, the only way to share the data is to push it out of the
cache as you guessed. On A8 shareable, cacheable memory ends up
being treated as un-cached.
Someone, please correct me if I'm wrong - it's so hard to come by definitive and concise statements on the topic.
Question: I need to define an MPU region for a QSPI Flash memory.
QSPI memory should be configured as Strongly Ordered. I don't really understand why?
The MPU guide above claims at least two points: prevent speculative access and prevent writes from being fragmented (e.g. interrupted by reading operations).
Speculative memory read may cause high latency or even system error
when performed on external memories like SDRAM, or Quad-SPI.
External memories even don't need to be connected to the microcontroller,
but its memory range is accessible by speculative read because by
default, its memory region is set as Normal.
Speculative access is never made to Strongly Ordered and Device memory
areas.
Strongly Ordered memory type is used in memories which need to have each write be a single transaction
For Strongly Ordered memory region CPU waits for the end of memory access instruction.
Finally, I suspect that alignment can be a requirement from the memory side which is adequately represented by a memory type that enforces aligned read/write access.
https://developer.arm.com/documentation/ddi0489/d/memory-system/axim-interface/memory-system-implications-for-axi-accesses
However, Device and Strongly-ordered memory are always Non-cacheable.
Also, any unaligned access to Device or Strongly-ordered memory
generates alignment UsageFault and therefore does not cause any AXI
transfer. This means that the access examples are given in this chapter
never show unaligned accesses to Device or Strongly-ordered memory.
UsageFault : Without explicit configuration, UsageFault defaults to calling the HardFault handler. Differentiated error handling needs to be enabled in SCB System Handler Control and State Register first:
SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk // will also be set by HAL_MPU_Enable()
| SCB_SHCSR_BUSFAULTENA_Msk
| SCB_SHCSR_USGFAULTENA_Msk;
UsageFault handlers can evaluate UsageFault status register (UFSR) described in https://www.keil.com/appnotes/files/apnt209.pdf.
printf("UFSR : 0x%4x\n", (SCB->CFSR >> 16) & 0xFFFF);

Exclusive access to L1 cacheline on x86?

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?
As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.

No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.
You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.
I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.
Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).
So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).
As #fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.
You normally don't need this.
If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.
All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.
However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.
I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
See also Which cache mapping technique is used in intel core i7 processor?
#AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)
Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.

There is no direct way to achieve that on Intel and AMD x86 processors, but you can get pretty close with some effort. First, you said you're worried that the cache line might get evicted from the L1 because some other core might access it. This can only happen in the following situations:
The line is shared, and therefore, it can be accessed by multiple agents in the system concurrently. If another agent attempts to read the line, its state will change from Modified or Exclusive to Shared. That is, it will state in the L1. If, on the other hand, another agent attempts to write to the line, it has to be invalidated from the L1.
The line can be private or shared, but the thread got rescheduled by the OS to run on another core. Similar to the previous case, if it attempts to read the line, its state will change from Modified or Exclusive to Shared in both L1 caches. If it attempts to write to the line, it has to be invalidated from the L1 of the previous core on which it was running.
There are other reasons why the line may get evicted from the L1 as I will discuss shortly.
If the line is shared, then you cannot disable coherency. What you can do, however, is make a private copy of it, which effectively does disable coherency. If doing that may lead to faulty behavior, then the only thing you can do is to set the affinity of all threads that share the line to run on the same physical core on a hyperthreaded (SMT) Intel processor. Since the L1 is shared between the logical cores, the line will not get evicted due to sharing, but it can still get evicted due to other reasons.
Setting the affinity of a thread does not guarantee though that other threads cannot get scheduled to run on the same core. To reduce the probability of scheduling other threads (that don't access the line) on the same core or rescheduling the thread to run on other physical cores, you can increase the priority of the thread (or all the threads that share the line).
Intel processors are mostly 2-way hyperthreaded, so you can only run two threads that share the line at a time. so if you play with the affinity and priority of the threads, performance can change in interesting ways. You'll have to measure it. Recent AMD processors also support SMT.
If the line is private (only one thread can access it), a thread running on a sibling logical core in an Intel processor may cause the line to be evicted because the L1 is competitively shared, depending on its memory access behavior. I will discuss how this can be dealt with shortly.
Another issue is interrupts and exceptions. On Linux and maybe other OSes, you can configure which cores should handle which interrupts. I think it's OK to map all interrupts to all other cores, except the periodic timer interrupt whose interrupt handler's behavior is OS-dependent and it may not be safe to play with it. Depending on how much effort you want to spend on this, you can perform carefully designed experiments to determine the impact of the timer interrupt handler on the L1D cache contents. Also you should avoid exceptions.
I can think of two reasons why a line might get invalidated:
A (potentially speculative) RFO with intent for modification from another core.
The line was chosen to be evicted to make space for another line. This depends on the design of the cache hierarchy:
The L1 cache placement policy.
The L1 cache replacement policy.
Whether lower level caches are inclusive or not.
The replacement policy is commonly not configurable, so you should strive to avoid conflict L1 misses, which depends on the placement policy, which depends on the microarchitecture. On Intel processors, the L1D is typically both virtually indexed and physically indexed because the bits used for the index don't require translation. Since you know the virtual addresses of all memory accesses, you can determine which lines would be allocated from which cache set. You need to make sure that the number of lines mapped to the same set (including the line you don't want it to be evicted) does not exceed the associativity of the cache. Otherwise, you'd be at the mercy of the replacement policy. Note also that an L1D prefetcher can also change the contents of the cache. You can disable it on Intel processors and measure its impact in both cases. I cannot think of an easy way to deal with inclusive lower level caches.
I think the idea of "pinning" a line in the cache is interesting and can be useful. It's a hybrid between caches and scratch pad memories. The line would be like a temporary register mapped to the virtual address space.
The main issue here is that you want to both read from and write to the line, while still keeping it in the cache. This sort of behavior is currently not supported.

Cache invalidation while MMU init on RPI2

Recently i have experimented with MMU initialization code on raspberry pi 2 and encountered with strange behavior. What i am trying to do is to establish trivial sections mapping.
I used this code as reference base. Although, the brief review had shown that this code is written for bcm2835, still don't have anything better than that.
The problem, i have encountered, was dead-end after cache flushing. Here is the full sample of start_mmu function
.globl start_mmu
start_mmu:
mov r2,#0
mcr p15,0,r2,c7,c7,0 ;# invalidate caches
mcr p15,0,r2,c8,c7,0 ;# invalidate tlb
mcr p15,0,r2,c7,c10,4 ;# DSB ??
mvn r2,#0
bic r2,#0xC
mcr p15,0,r2,c3,c0,0 ;# domain
mcr p15,0,r0,c2,c0,0 ;# tlb base
mcr p15,0,r0,c2,c0,1 ;# tlb base
mrc p15,0,r2,c1,c0,0
orr r2,r2,r1
mcr p15,0,r2,c1,c0,0
In other words i get dead-end on cache invalidating line:
mcr p15,0,r2,c7,c7,0 ;# invalidate caches
By dead-end i mean that i can't print something after this line was executed. It seems that i falling in to some exception at that moment.
If i omit this cache inval line, i can go forward, but it seems that MMU mapping aren't established correctly after my setup (but this is another question).
What i want to know is:
1.) Why do we need invalidate caches and tlb before MMU startup?
2.) What could be the reason of dead-end problem?

Why do we need invalidate caches and tlb before MMU startup?
Because they could contain uninitialised junk (or just stale entries after a reset). As soon as you turn the MMU on, addresses for instruction/data accesses may be looked up in the TLBs, and if any of that junk happens to look sufficiently like a valid entry matching the relevant virtual address then you're going to have a bad time. Similarly for the instructions/data themselves once the caches are enabled.
What could be the reason of dead-end problem?
You're executing an invalid instruction.
If you want to write bare-metal code, it pays to understand the metal you're running on - the Raspberry Pi 2 has Cortex-A7 cores, which are not the same as the ARM1176 core in the other models, and thus behave differently. Specifically in this case, the CP15 c7, 0, c7 system register space where unified cache operations lived under the ARMv6 architecture is no longer allocated in ARMv7, thus attempting to access it leads to unpredictable behaviour. You need to invalidate your I-cache and D-cache separately. I'd recommend at very least looking at the Cortex-A7 TRM, and ideally the Architecture Reference Manual. Also for real-world examples, there's always Linux and friends. Yes it's an awful lot to take in, but hey, this is a full-blown multi-core mobile-class application processor, not some microcontroller ;)
Now, the first priority should be to set up some exception vector handlers that will give some debug output when things go wrong, because a lot more things are bound to go wrong from here on in...

2 basic computer questions

Question 1:
Where exactly does the internal register and internal cache exist? I understand that when a program is loaded into main memory it contains a text section, a stack, a heap and so on. However is the register located in a fixed area of main memory, or is it physically on the CPU and doesn't reside in main memory? Does this apply to the cache as well?
Questions 2:
How exactly does a device controller use direct memory access without using the CPU to schedule/move datum between the local buffer and main memory?

Basic answer:
The CPU registers are directly on the CPU. The L1, L2, and L3 caches are often on-chip; however, they may be shared between multiple cores or processors, so they're not always "physically on the CPU." However, they're never part of main memory either. The general principle is that the closer memory is to the CPU, the faster and more expensive (and thus smaller) it is. Every item in the cache has a particular main memory address associated with it (however, the same slot can be associated with different addresses at different times). However, there is no direct association between registers and main memory. That is why if you use the register keyword in C (not that it's often necessary, since the compiler is usually a better optimizer), you can not use the & operator.
The DMA controller executes the transfer directly. The CPU watches the bus so it knows when changes are made "behind its back", which invalidate its cache(s).

Even though the CPU is the central processing unit, it's not the sole "mover and shaker". Devices live on buses, along with CPUs, and also RAM. Modern buses allow devices to communicate with RAM without involving the CPU. Some devices are programmed simply by making changes to pieces of RAM which devices poll. Device drivers may poll pieces of RAM that a device is writing into, but usually a CPU receives an interrupt from the device telling it that there's something in a piece of RAM that's ready to read.
So, in answer to your question 2, the CPU isn't involved in memory transfers across the bus, except inasmuch as cache coherence messages about the invalidation of cache lines are involved. Bear in mind that the scenarios are tricky. The CPU might have modified byte 1 on a cache line when a device decides to modify byte 40. Getting that dirty cache line out of the CPU needs to happen before the device can modify data, but on x86 anyway, that activity is initiated by the bus, not the CPU.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio