Cache invalidation while MMU init on RPI2 - caching

Recently i have experimented with MMU initialization code on raspberry pi 2 and encountered with strange behavior. What i am trying to do is to establish trivial sections mapping.
I used this code as reference base. Although, the brief review had shown that this code is written for bcm2835, still don't have anything better than that.
The problem, i have encountered, was dead-end after cache flushing. Here is the full sample of start_mmu function
.globl start_mmu
start_mmu:
mov r2,#0
mcr p15,0,r2,c7,c7,0 ;# invalidate caches
mcr p15,0,r2,c8,c7,0 ;# invalidate tlb
mcr p15,0,r2,c7,c10,4 ;# DSB ??
mvn r2,#0
bic r2,#0xC
mcr p15,0,r2,c3,c0,0 ;# domain
mcr p15,0,r0,c2,c0,0 ;# tlb base
mcr p15,0,r0,c2,c0,1 ;# tlb base
mrc p15,0,r2,c1,c0,0
orr r2,r2,r1
mcr p15,0,r2,c1,c0,0
In other words i get dead-end on cache invalidating line:
mcr p15,0,r2,c7,c7,0 ;# invalidate caches
By dead-end i mean that i can't print something after this line was executed. It seems that i falling in to some exception at that moment.
If i omit this cache inval line, i can go forward, but it seems that MMU mapping aren't established correctly after my setup (but this is another question).
What i want to know is:
1.) Why do we need invalidate caches and tlb before MMU startup?
2.) What could be the reason of dead-end problem?

Why do we need invalidate caches and tlb before MMU startup?
Because they could contain uninitialised junk (or just stale entries after a reset). As soon as you turn the MMU on, addresses for instruction/data accesses may be looked up in the TLBs, and if any of that junk happens to look sufficiently like a valid entry matching the relevant virtual address then you're going to have a bad time. Similarly for the instructions/data themselves once the caches are enabled.
What could be the reason of dead-end problem?
You're executing an invalid instruction.
If you want to write bare-metal code, it pays to understand the metal you're running on - the Raspberry Pi 2 has Cortex-A7 cores, which are not the same as the ARM1176 core in the other models, and thus behave differently. Specifically in this case, the CP15 c7, 0, c7 system register space where unified cache operations lived under the ARMv6 architecture is no longer allocated in ARMv7, thus attempting to access it leads to unpredictable behaviour. You need to invalidate your I-cache and D-cache separately. I'd recommend at very least looking at the Cortex-A7 TRM, and ideally the Architecture Reference Manual. Also for real-world examples, there's always Linux and friends. Yes it's an awful lot to take in, but hey, this is a full-blown multi-core mobile-class application processor, not some microcontroller ;)
Now, the first priority should be to set up some exception vector handlers that will give some debug output when things go wrong, because a lot more things are bound to go wrong from here on in...

Related

Linux - Streaming DMA - Explicit flush/invalidate

The documentation on the Streaming DMA API mentions that in order to ensure consistency, the cache needs to be flushed before dma-mapping to device, and invalidated after unmapping from device.
However, I confused if the flush and invalidate needs to be performed explicitly, i.e., Do the functions dma_map_single() & dma_sync_single_for_device() already take care of flushing the cachelines, or does the driver develop need to call some function to explicitly flush the cachelines of the dma buffer? Same goes for dma_unmap_single() & dma_sync_single_for_cpu()..do these 2 functions automatically invalidate the dma-buffer cache lines?
I skimmed through some existing drivers that use streaming dma and I can't see any explicit calls to flush or invalidate the cachelines.
I also went through the kernel source code and it seems that the above mentioned functions all 'invalidate' the cachelines in their architecture specific implementations, which further adds to my confusion..e.g., in arch/arm64/mm/cache.S
SYM_FUNC_START_PI(__dma_map_area)
add x1, x0, x1
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
SYM_FUNC_END_PI(__dma_map_area)
Can someone please clarify this? Thanks.
So, based on received comments and some more findings, I thought to answer this question myself for others with similar queries. The following is specific for ARM64 architecures. Other architectures may have a slightly different implementation.
When using the Streaming DMA API, one does NOT have to explicitly flush or invalidate the cachelines. Functions dma_map_single(), dma_sync_single_for_device(), dma_unmap_single(), dma_sync_single_for_cpu() take care of that for you. E.g. dma_map_single() and dma_sync_single_for_device() both end up calling architecture dependent function __dma_map_area
ENTRY(__dma_map_area)
cmp w2, #DMA_FROM_DEVICE
b.eq __dma_inv_area
b __dma_clean_area
ENDPIPROC(__dma_map_area)
In this case, if the direction specified is DMA_FROM_DEVICE, then the cachelines are invalidated (because data must have come from the device to memory and the cachelines need to be invalidated so that any read from CPU will fetch the new data from memory). If direction is DMA_TO_DEVICE/BIDIRECTIONAL then a flush operation is performed (because data could have been written by the CPU and so the cached data needs to be flushed to the memory for valid data to be written to the device).
NOTE that the 'clean' in __dma_clean_area is ARM's nomenclature for cache flush.
Same goes for dma_unmap_single() & dma_sync_single_for_cpu() which end up calling __dma_unmap_area() which invalidates the cachelines if the direction specified is not DMA_TO_DEVICE.
ENTRY(__dma_unmap_area)
cmp w2, #DMA_TO_DEVICE
b.ne __dma_inv_area
ret
ENDPIPROC(__dma_unmap_area)
dma_map_single() and dma_unmap_single() are expensive operations since they also include some additional page mapping/unmapping operations so if the direction specified is to remain constant, it is better to use the dma_sync_single_for_cpu() and dma_sync_single_for_device() functions.
On a side note, for my case, using Streaming DMA resulted in ~10X faster read operations compared to Coherent DMA. However, the user code implementation gets a little more complicated because you need to ensure that the memory is not accessed by the cpu while dma is mapped to the device (or that the sync operations are called before/after cpu access).

Is it possible to “abort” when loading a register from memory rather the triggering a page fault?

I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'
'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.
I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.
(Ideally, I would also like to abort on a TLB fault.)
The RTM (Restricted Transactional Memory) part of the TXT-NI feature allows to suppress exceptions:
Any fault or trap in a transactional region that must be exposed to software will be suppressed. Transactional
execution will abort and execution will transition to a non-transactional execution, as if the fault or trap had never
occurred.
[...]
Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XM, #PF, #NM, #TS, #MF, #DB, #BP/INT3) that occur during transactional execution may cause an execution not to commit transactionally, and
require a non-transactional execution. These events are suppressed as if they had never occurred.
I've never used RTM but it should work something like this:
xbegin fallback
; Don't fault here
xend
; Somewhere else
fallback:
; Retry non-transactionally
Note that a transaction can be aborted for many reasons, see chapter 16.8.3.2 of the Intel manual volume 1.
Also note that RTM is not ubiquitous.
Besides RTM I cannot think of another way to suppress a load since it must return a value or eventually signal an abort condition (which would be the same as a #PF).
There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.
(For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)
Would a CPU feature like this be useful? probably yes for a few cases
Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.
When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.
Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)
So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.
If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows
PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)
This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.
CPU design possibilities
Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.
There are a couple possibilities that come to mind:
a queryTLB m8 instruction that writes flags (e.g. CF=1 for present) according to whether the memory operand is currently hot in TLB (including 2nd-level TLB), never doing a page walk. And a querypage m8 that will do a page walk on a TLB miss, and sets flags according to whether there's a page table entry. Putting the result in a r32 integer reg you could test/jcc on would also be an option.
a try_load r32, r/m32 instruction that does a normal load if possible, but sets flags instead of taking a page fault if a page walk finds no valid entry for the virtual address. (e.g. CF=1 for valid, CF=0 for abort with integer result = 0, like rdrand. It could make itself useful and set other flags (SF/ZF/PF) according to the value, if there is one.)
The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)
A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).
These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.
Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)
Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.
Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.
The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.
I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.

Prefetch instruction behavior

In order to satisfy some security property, I want to make sure that an important data is already in the cache when a statement accesses it (so there will be no cache miss). For example, for this code
...
a += 2;
...
I want to make sure that a is in the cache right before a += 2 is executed.
I was considering to use the PREFETCHh instruction of x86 to achieve this:
...
__prefetch(&a); /* pseudocode */
a += 2;
...
However, I have read that inserting the prefetch instruction right before a += 2 might be too late to ensure a is in the cache when a += 2 gets executed. Is this claim true? If it is true, can I fix it by inserting a CPUID instruction after prefetch to ensure the prefectch instruction has been executed (because the Intel manual says PREFETCHh is ordered with respect to CPUID)?
Yes, you need to prefetch with a lead-time of about the memory latency for it to be optimal. Ulrich Drepper's What Every Programmer Should Know About Memory talks a lot about prefetching.
Making this happen will be highly non-trivial for a single access. Too soon and your data might be evicted before the insn you care about. Too late and it might reduce the access time some. Tuning this will depend on compiler version/options, and on the hardware you're running on. (Higher instructions-per-cycle means you need to prefetch earlier. Higher memory latency also means you need to prefetch earlier).
Since you want to do a read-modify-write to a, you should use PREFETCHW if available. The other prefetch instructions only prefetch for reading, so the read part of a the RMW could hit, but I think the store part could be delayed by MOSI cache coherency getting write-ownership of the cache line.
If a isn't atomic, you can also just load a well ahead of time and use the copy in a register. The store back to the global could easily miss in this case, which could eventually stall execution, though.
You'll probably have a hard time doing some that reliably with a compiler, instead of writing asm yourself. Any of the other ideas will also require checking the compiler output to make sure the compiler did what you're hoping.
Prefetch instructions don't necessarily prefetch anything. They're "hints", which presumably get ignored when the number of outstanding loads is near max (i.e. almost out of load buffers).
Another option is to load it (not just prefetch) and then serialize with a CPUID. (A load that throws away the result is like a prefetch). The load would have to complete before the serializing instruction, and instructions after the serializing insn can't start decoding until then. I think a prefetch can retire before the data arrives, which is normally an advantage, but not in this case where we care about one operation hitting at the expense of overall performance.
From Intel's insn ref manual (see the x86 tag wiki) entry for CPUID:
Serializing instruction execution
guarantees that any modifications to flags, registers, and memory for previous instructions are completed before
the next instruction is fetched and executed.
I think a sequence like this is fairly good (but still doesn't guarantee anything in a pre-emptive multi-tasking system):
add [mem], 0 # can't retire until the store completes, requiring that our core owns the cache line for writing
CPUID # later insns can't start until the prev add retires
add [mem], 2 # a += 2 Can't miss in cache unless an interrupt or the other hyper-thread evicts the cache line before this insn can execute
Here we're using add [mem], 0 as a write-prefetch which is otherwise a near no-op. (It is a non-atomic read-modify-rewrite). I'm not sure if PREFETCHW really will ensure the cache line is ready if you do PREFETCHW / CPUID / add [mem], 2. The insn is ordered wrt. CPUID, but the manual doesn't say that the prefetch effect is ordered.
If a is volatile, then (void)a; will get gcc or clang to emit a load insn. I assume most other compilers (MSVC?) are the same. You can probably do (void) *(volatile something*)&a to dereference a pointer to volatile and force a load from a's address.
To guarantee that a memory access will hit in cache, you'd need to be running at realtime priority pinned to a core that doesn't receive interrupts. Depending on the OS, the timer-interrupt handler is probably lightweight enough that the chance of evicting your data from cache is low enough.
If your process is descheduled between executing a prefetch insn and doing the real access, the data will probably have been evicted from at least L1 cache.
So it's unlikely you can defeat an attacker determined to do a timing attack on your code, unless it's realistic to run at realtime priority. An attacker could run many many threads of memory-intensive code...

relationship between CPUECTLR.SMPEN, caches and MMU

I'm reading ARM document (ARM ® Cortex ® -A57 MPCore Processor) and see the following descriptions about
You must set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled, or any instruction cache or TLB maintenance operations are performed.
CPUECTLR.SMPEN is for:
Enables the processor to receive instruction cache and TLB maintenance operations broadcast from other processors in the cluster.
You must set this bit before enabling the caches and MMU, or performing any cache and TLB maintenance operations.
You must clear this bit during a processor power down sequence.
However, it is still unclear for me the real reason (i.e., why we should set CPUECTLR.SMPEN to 1 before the caches and MMU are enabled). Please help me on this. Thanks.
Simply put, SMPEN essentially controls whether the core participates in coherency protocols or not.
Without it set, any TLB or cache maintenance operation a core performs will only affect that core, and it won't be aware of other cores doing the same, nor of data in other cores' private caches - on an SMP system with all the cores operating on the same regions of memory, this is generally a recipe for data corruption and disaster.
Say everyone has their MMUs and caches enabled, and core A goes to remap some page of memory - it writes zeros to the PTE, invalidates its TLB for that VA, then writes the updated PTE. Core B could also have a TLB entry for that VA: unless the TLBI is broadcast, core B won't be aware that its entry for that VA is no longer valid, and could read bogus data or worse corrupt the old physical page now that it may have been reused for something else.
OK, perhaps core B didn't have that address cached in its TLB, but goes to access it after the update, and kicks off a page table walk. Without cache coherency, this goes several ways:
Core B happens to have the page table cached in its L1; unless it can snoop core A's L1 to know that someone else now has a dirty copy of that line and its own copy is now invalid, it's going to read the stale old PTE and go wrong.
Core B doesn't have the page tables cached at L1; unless it can coherently snoop the dirty line from core A's L1, the read goes out to L2 or main memory, hits the stale old PTE and goes wrong.
Core B doesn't have the page tables cached at L1, but core A's first write has already propagated out to L2 or further; unless core B's read can snoop the second write from core A's L1, it reads the intermediate invalid PTE from L2 and takes a fault.
Core B doesn't have the page tables cached at L1, but both of core A's writes have already propagated out to L2 or further; core B's read hits the new PTE in L2, and everything manages to work as expected by pure chance.
Now, there are some situations in which you might not want this - in asymmetric multiprocessing, where the two cores might be doing completely unrelated things, running different operating systems, and working in separate areas of memory, there might be a small benefit from not having unnecessary coherency chit-chat going on in the background - on the rare occasions the cores might want to communicate with each other there, they would probably do so via inter-processor interrupts and a specific shared area of uncached memory. For SMP, though, you really do want the cores to know about each other and be part of the same coherency domain before they have a chance to start actually allocating cache lines and TLB entries, which is precisely why the control of all the broadcast and coherency machinery is wrapped up in a single, somewhat-vaguely-named "SMP enable" bit.
To elaborate on actually entering and exiting coherency, when coming in you want to be sure that your whole data cache is invalid to avoid conflicting entries - If a CPU enters SMP with valid lines already in its cache for addresses owned by lines in other CPUs' coherent caches, the coherency protocol is broken and data loss/corruption ensues. Conversely, when going offline, the CPU has to guarantee its cache is clean to avoid data loss - it can prevent itself dirtying any more entries by disabling its cache/MMU, but it also has to exit coherency to prevent dirty lines being transferred in from other CPUs behind its back. Only then is it safe to perform the set/way operations necessary to clean the whole local cache before the contents are lost at powerdown.

Flush cache to DRAM

I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.
I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.
The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...
As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.
In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?
Thanks.
The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.
To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.
static inline void dcache_clean(void)
{
const int zero = 0;
/* clean entire D cache -> push to external memory. */
__asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
" bne 1b\n" ::: "cc");
/* drain the write buffer */
__asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
}
You may need more if you have an L2 cache.
To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,
proc-arm926.S
cache-v7.S
cache-v6.S
etc
These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.
To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX() routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).
I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:
#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);
start is a kernel virtual space pointer. You also need to flush L1 to L2:
__cpuc_flush_dcache_area(start,size);
I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...
Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed.
(For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)
I'm not familiar with Zynq, but you essentially have two options that really work:
either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.
If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.

Resources