Managing cache with memory mapped I/O - caching

I have a question regarding memory mapped io.
Suppose, there is a memory mapped IO peripheral whose value is being read by CPU. Once read, the value is stored in cache. But the value in memory has been updated by external IO peripheral.
In such cases how will CPU determine cache has been invalidated and what could be the workaround for such case?

That's strongly platform dependent. And actually, there are two different cases.
Case #1. Memory-mapped peripheral. This means that access to some range of physical memory addresses is routed to peripheral device. There is no actual RAM involved. To control caching, x86, for example, has MTRR ("memory type range registers") and PAT ("page attribute tables"). They allow to set caching mode on particular range of physical memory. Under normal circumstances, range of memory mapped to RAM is write-back cacheable, while range of memory mapped to periphery devices is uncacheable. Different caching policies are described in Intel's system programming guide, 11.3 "Methods of caching available". So, when you issue read or write request to memory mapped peripheral, CPU cache is bypassed, and request goes directly to the device.
Case #2. DMA. It allows peripheral devices to access RAM asynchronously. In this case, DMA controller is no different from any CPU and equally participates in cache coherency protocol. Write request from periphery is seen by caches of other CPUs, and cache lines are either invalidated or are updated with new data. Read request also is seen by caches of other CPUs and data is returned from cache rather than from main RAM. (This is only an example: actual implementation is platform dependent. For example, SoC typically do not guarantee strong cache coherency peripheral <-> CPU.)
In both cases, the problem of caching also exists at compiler level: complier may cache data values in registers. That's why programming languages has some means of prohibiting such optimization: for example, volatile keyword in C.

Related

Hyper-Threading data cache context aliasing

in intel's manual, the following section confuse me:
11.5.6.2 Shared Mode In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if
the logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased,
meaning that one linear address in the cache can point to different
physical locations. The mechanism for resolving aliasing can lead to
thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the
preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology.
as intel use VIPT(equals to PIPT) to access cache.
how cache aliasing would happened ?
Based on IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, November 2009 (248966-020), Section 2.6.1.3:
Most resources in a physical processor are fully shared to improve the
dynamic utilization of the resource, including caches and all the
execution units. Some shared resources which are linearly addressed,
like the DTLB, include a logical processor ID bit to distinguish
whether the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID
bit:
Shared mode: The L1 data cache is fully shared by two logical
processors.
Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the
L1 data cache.
Aliasing is possible because the processor ID/context-ID bit (which is just a bit indicating which virtual processor the memory access came from) would be different for different threads and shared mode uses that bit. Adaptive mode simply addresses the cache as one would normally expect, only using the memory address.
Specifically how the processor ID is used when indexing the cache in shared mode appears not to be documented. (XORing with several address bits would provide dispersal of indexes such that adjacent indexes for one hardware thread would map to more separated indexes for the other thread. Selecting a different bit order for different threads is less likely since such would tend to increase delay. Dispersal reduces conflict frequency given spatial locality above cache line granularity but less than way-size granularity.)

STM32H7 MPU shareable memory attribute and strongly ordered memory type

I am confused by some of the attributes of the STM32H7 MPU.
I've read several documents: STM32H7 reference and programming manual, STMicro application note on MPM, etc...
I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct ?
I need to define a MPU region for a QSPI Flash memory. A document from MicroChip (reference TB3179) indicates that the QSPI memory should be configured as Strongly Ordered. I don't really understand why ?
Question: I've understood that shareable is exactly equivalent to non-cacheable (at least on a single core STM32H7). Is it correct?
Here's an ST guide to MPU configuration:
https://www.st.com/content/st_com/en/support/learning/stm32-education/stm32-moocs/STM32_MPU_tips.html
If some area is Cacheable and Shareable, only instruction cache is used in STM32F7/H7
As STM32 [F7 and H7] microcontrollers don't contain any hardware
feature for keeping data coherent, setting a region as Shareable
means that data cache is not used in the region. If the region is not
shareable, data cache can be used, but data coherency between bus
masters need to be ensured by software.
Shareable on STM32H7 seems to be implicitly synonymous with non-cached access when INSTRUCTION_ACCESS_DISABLED (Execute Never, code execution disabled).
Furthermore,
https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5468/shareability-memory-attribute
The sharability attribute tells the processor it must do whatever
is necessary to allow that data to be shared. What that really
means depends on the features of a particular processor.
On a processor with multi-CPU hardware cache coherency; the
shareability attribute is a signal to engage the cache coherency logic.
For example A57 can maintain cache-coherency of shareable data within
the cluster and between clusters if connected via a coherent
interconnect.
On a processor without hardware cache coherency, such as Cortex-A8, the only way to share the data is to push it out of the
cache as you guessed. On A8 shareable, cacheable memory ends up
being treated as un-cached.
Someone, please correct me if I'm wrong - it's so hard to come by definitive and concise statements on the topic.
Question: I need to define an MPU region for a QSPI Flash memory.
QSPI memory should be configured as Strongly Ordered. I don't really understand why?
The MPU guide above claims at least two points: prevent speculative access and prevent writes from being fragmented (e.g. interrupted by reading operations).
Speculative memory read may cause high latency or even system error
when performed on external memories like SDRAM, or Quad-SPI.
External memories even don't need to be connected to the microcontroller,
but its memory range is accessible by speculative read because by
default, its memory region is set as Normal.
Speculative access is never made to Strongly Ordered and Device memory
areas.
Strongly Ordered memory type is used in memories which need to have each write be a single transaction
For Strongly Ordered memory region CPU waits for the end of memory access instruction.
Finally, I suspect that alignment can be a requirement from the memory side which is adequately represented by a memory type that enforces aligned read/write access.
https://developer.arm.com/documentation/ddi0489/d/memory-system/axim-interface/memory-system-implications-for-axi-accesses
However, Device and Strongly-ordered memory are always Non-cacheable.
Also, any unaligned access to Device or Strongly-ordered memory
generates alignment UsageFault and therefore does not cause any AXI
transfer. This means that the access examples are given in this chapter
never show unaligned accesses to Device or Strongly-ordered memory.
UsageFault : Without explicit configuration, UsageFault defaults to calling the HardFault handler. Differentiated error handling needs to be enabled in SCB System Handler Control and State Register first:
SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk // will also be set by HAL_MPU_Enable()
| SCB_SHCSR_BUSFAULTENA_Msk
| SCB_SHCSR_USGFAULTENA_Msk;
UsageFault handlers can evaluate UsageFault status register (UFSR) described in https://www.keil.com/appnotes/files/apnt209.pdf.
printf("UFSR : 0x%4x\n", (SCB->CFSR >> 16) & 0xFFFF);

How to flush an address range in L1 and L2 Cache from Linux kernel space on ARM v7?

I am writing a dummy driver to share kernel buffer to user space on ARM v7.
I want to implement fsync() operation for this buffer. Which APIs should I use to flush L1 and L2 cache for a given user address range in my fsync?
There are many APIs available in asm/cacheflush.h, but I am not sure weather they will flush both L1 and L2 or only L1?
Currently I am using
dmac_flush_range()
outer_flush_range()
APIs. Are they fine for the use case?
Thanks!
ARMv7 mandates that data caches behave as if physically-indexed and physically-tagged*, which means that multiple virtual addresses mapping to the same physical address are naturally coherent with each other without requiring any cache maintenance or barriers. Therefore the kernel mapping and user mapping of your buffer are already fully in sync at all times, and there's not really anything you need to do. You certainly don't have any of the VIVT cache problems of older CPUs.
That said, using those architecture-private cache APIs directly from a driver would get you roundly shouted at by kernel maintainers these days - drivers should normally only need to care about cache maintenance at all when DMA is involved, but correct use of the DMA mapping API already takes care of everything in that regard.
* they don't strictly have to be PIPT, for instance Cortex-A8's L1 which is actually non-aliasing VIPT under the hood.

DMA vs Cache difference

Probably a stupid question for most that know DMA and caches... I just know cache stores memory to somewhere closer to where you can access so you don't have to spend as much time for the I/O.
But what about DMA? It lets you access that main memory with less delay?
Could someone explain the differences, both, or why I'm just confused?
DMA is a hardware device that can move to/from memory without using CPU instructions.
For instance, a hardware device (lets say, your PCI sound device) wants audio to play back. You can either:
Write a word at a time via a CPU mov instructions.
Configure the DMA device. You give it a start address, a destination, and the number of bytes to copy. The transfer now occurs while the CPU does something else instead of spoon feeding the audio device.
DMA can be very complex (scatter gather, etc), and varies by bus type and system.
I agree fully with the first answer, and there are some common additions...
On most DMA hardwares you can also set it up to do memory to memory transfers - there are not always external devices involved. Also depending on the system you may or may not need to sync the CPU-cache in software before (or after the transfer), since the data the DMA transfers into/from memory may be done without the knowledge of the CPU-cache.
The benefit of doing any DMA is that the CPU(s) is/are able to do other things simultaneously.
Of course when the CPU also needs to access the memory, only one can gain access and the other must wait.
Mem to mem DMA is often used in embedded systems to increase performance, or may be vital to be able to access some parts of the memory at all.
To answer the question, DMA and CPU-cache are totally different things and not comparable.
I know its a bit late but answering this question will help someone like me I guess, Agreeing with the above answers, I think the question was in relation to cache.
So Yes a cache does store information somewhere closer to the memory, this could be the results of earlier computations. Moreover, whenever a data is found in cache (called a cache hit) the value is used directly. when its not found (called a cache-miss), the processor goes on to calculate the required value. Peripheral Devices (SD cards, USBs etc) can also access this data, which is why on startup we usually invalidate cache data so that the cache line is clean. We also flush cache data on startup so that all the cache data is written back to the main memory for cpu to use, after which we proceed to reset or initialize the cache.
DMA (Direct Memory Access), yes it does let you access the main memory. But I think the better definition is, it lets you access the system register, which can only be accessed by the processor. #Ronnie and #Yann Ramin were both correct in that DMA can be a device hardware, so it can be used by your serial peripheral to access system registers, but it can also be used for memory to memory transfers between two cores.
You can read up further on DMA from wikipedia, about the modes in which DMA can access the system memory. I ll explain it simply
Burst mode: DMA takes full control of the bus, CPU is idle during this time. Data is transferred in burst (as a whole) without interruption.
Cycle stealing mode: In this data is transfered one byte at a time, transfer is slow, but CPU is not idle.

2 basic computer questions

Question 1:
Where exactly does the internal register and internal cache exist? I understand that when a program is loaded into main memory it contains a text section, a stack, a heap and so on. However is the register located in a fixed area of main memory, or is it physically on the CPU and doesn't reside in main memory? Does this apply to the cache as well?
Questions 2:
How exactly does a device controller use direct memory access without using the CPU to schedule/move datum between the local buffer and main memory?
Basic answer:
The CPU registers are directly on the CPU. The L1, L2, and L3 caches are often on-chip; however, they may be shared between multiple cores or processors, so they're not always "physically on the CPU." However, they're never part of main memory either. The general principle is that the closer memory is to the CPU, the faster and more expensive (and thus smaller) it is. Every item in the cache has a particular main memory address associated with it (however, the same slot can be associated with different addresses at different times). However, there is no direct association between registers and main memory. That is why if you use the register keyword in C (not that it's often necessary, since the compiler is usually a better optimizer), you can not use the & operator.
The DMA controller executes the transfer directly. The CPU watches the bus so it knows when changes are made "behind its back", which invalidate its cache(s).
Even though the CPU is the central processing unit, it's not the sole "mover and shaker". Devices live on buses, along with CPUs, and also RAM. Modern buses allow devices to communicate with RAM without involving the CPU. Some devices are programmed simply by making changes to pieces of RAM which devices poll. Device drivers may poll pieces of RAM that a device is writing into, but usually a CPU receives an interrupt from the device telling it that there's something in a piece of RAM that's ready to read.
So, in answer to your question 2, the CPU isn't involved in memory transfers across the bus, except inasmuch as cache coherence messages about the invalidation of cache lines are involved. Bear in mind that the scenarios are tricky. The CPU might have modified byte 1 on a cache line when a device decides to modify byte 40. Getting that dirty cache line out of the CPU needs to happen before the device can modify data, but on x86 anyway, that activity is initiated by the bus, not the CPU.

Resources