ARM: Using cache memory to speed up critical code

ARM: Using cache memory to speed up critical code - caching

Suppose that I have an embedded project (with an ARM Cortex-M if it makes a difference) where parts of the code are critical and need to run fast and in a deterministic time as much as possible.
Would it be possible to sacrifice part of the L1 cache and reserve it for the critical code/data? I could then load the critical code/data and always run/access them at L1 cache speeds.

Ok I think the answer is "technically speaking, no". Memory allocated as cache memory is used by the cache controller to do what it should, and that is caching.
So hopefully the chip vendor has provided ways to run code from the fastest memory available. If the chip has TCM, then loading your critical code there should be fine and run as fast as it would run when cached in L1 cache. If the chip provides flash and RAM, then loading critical code on RAM should also be much faster. In the latter case, the cache controller, if it exists, may be configured to use the same RAM for running cached code anyway.

Yes, it is possible:
TB3186
"How to Achieve Deterministic Code Performance Using aCortex™-M Cache Controller"
http://ww1.microchip.com/downloads/en/DeviceDoc/How-to-Achieve-Deterministic-Code-Performance-using-CortexM-Cache-Controller-DS90003186A.pdf
...
With CMCC, a part of the cache can be used as TCM for deterministic code performance by loading the critical code in a WAY and locking it. When a particular WAY is locked, the CMCC does not use the locked WAY for routine cache transactions. The locked cache WAY with the loaded critical code acts as an always-getting cache hit condition.

Related

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?

dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.

I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

Under which conditions caching of a page(not web caching) should not be allowed?

I would like to know in what ways,using a cache buffer(eg.TLB) to cache frequent pages would be not advantageous or potentially catastrophic.
I searched a bit but could not comprehend this:
"When that page is shared with another process running on a different core of the machine. For example, in Intel 3rd generation core architecture (Ivybridge) L1 and L2 cores are private for a core and L3 is shared. So, a shared page must not go above L3 cache or otherwise explicit coherence mechanism must be done by the programmer "

I don't know if you are asking about x86 CPU caches or generally for any architecture.
As far as I know (and I didn't find any source saying otherwise), in x86, the CPU hardware always ensures that you have a cache-coherent shared memory. So you as an application or OS programmer don't need to care about this. Also there is no instruction which would allow you to deny the CPU from caching your data in L1/L2 or L3 cache. This wouldn't make any sense, as for every write the CPUs must ensure shared cache is coherent and it wouldn't know about your single non-cacheable page. Although it seems there are ways to pre-fetch some data into cache if needed.
For other (theoretical and future) architectures, it is possible to present the cache (and hence whole memory) as not coherent - then the OS must be written so that it will ensure memory coherence itself. For example this paper writes about OS design for non-cache-coherent systems (like Intel's experimental Single-Chip Cloud computer).
For writing applications "friendly" to caching, there are many good questions and answers also here in SO, few of them mentioning this 114-pdf from RedHat: What Every Programmer Should Know About Memory. Here is also Wikipedia page for Cache Coherence.

Where do processor's instructions resides ? in CPU cache or?

My question is about CPU and its instructions. I know they have to be stored somewhere, so i would like to know where they are being stored ? I suppose in cache...

In cache, and memory (for all of the programs instructions which may not be needed anytime soon).
The exact cache depends on the CPU. And cache is really just a general term used to describe any "memory" on the CPU chip, as opposed to on the separate RAM chip.
A modern CPU often has multiple caches. That means that the cache used for storing instructions is called an "instruction cache".
Information on caches in general can be found at Wikipedia - http://en.wikipedia.org/wiki/CPU_cache.

Some latest CPU's have something like a loop-cache(for decoded-instructions). When your critical loop is 64 bytes or smaller, then it is the fastest. Then there are outer-caches like L1-instruction cache and L2-L3 caches. The outermost home of the instructions is RAM. Of course they all are loaded from ROM, HDD, modem, floppy disc, cd-rom,.... Those are written with programming languages using a keyboard which is used to pass instructions/opinions in your mind. ----> Brain is the source, cache and cpu-registers are the last stop.

Instructions are data, just as data is data. Everything in binary and everything is stored in RAM and everything is loaded from memory into the CPU (and on the way, is cached). The CPU knows when it is loading instructions, rather than data, because the CPU knows what it is doing (e.g. I'm running this instruction now, I need to run this one next, I need to load it... ah, I'm loading an instruction).

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?

Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.

I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.

Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.

Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.

In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Does accessing shared memory simultaneously cause a performance hit?

I have a simple multi-threaded app for my multi-core system. This app has a parallel region in which no threads write to a given memory address, but some may read simultaneously.
Will there still be some type of overhead or performance hit associated with several threads accessing the same memory even if though no locking is used? If so, why? How big an impact can it have and what can be done about it?

This can depend on the specific cache synchronization protocol in use, but most modern CPUs support having the same cache line shared in multiple processor caches, provided there is no write activity to the cache line. That said, make sure you align your allocations to the cache line size; if you don't, it's possible that data that's being written to could share the same cache line as your read-only data, resulting in a performance hit when the dirtied cache line is flushed on other processors (false sharing).

I would say there wouldn't be. However the problem arises when you have multiple writers to the same references.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio