Caching is a core thing when it comes to efficiency.
I know that caching usually happens automatically.
However, I'd like to control cache usage myself, because I think that I can do better than some heuristics that don't know the exact program.
Therefore I would need assembly instructions to directly move to or from cache memory cells.
like:
movL1 address content
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Are there any assemblers that allow for complete cache control?
Side note: why I'd like to improve caching:
consider a hypothetical CPU with 1 register and a cache containing 2 cells.
consider the following two programs:
(where x,y,z,a are memory cells)
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move z to x"
"move y to x"
"END"
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move a to x"
"move y to x"
"END"
In the first case, you'd use the register and the cache for x,y,z (a is only written to once)
In the second case, you'd use the register and the cache for a,x,y (z is only written to once)
If the CPU does the caching, it simply can't decide ahead of time which of the two above cases it's facing.
It has to decide for each of the memory cells x,y,z if its contents should be cached before it knows if the program executed, is no. 1 or no. 2, because both programs start out the same.
The programmer on the other hand knows ahead of time which memory cells are reused, and when they are reused.
Peter Cordes wrote:
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
This is correct, but the exceptions are of interest....
It is common in DSP ("Digital Signal Processing") chips to provide a limited ability to partition SRAM between "cache" and "scratchpad memory" functionality. There are lots of white papers and reference guides on this topic -- an example is http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf. In this chip, there are three blocks of SRAM -- a small "Level-1 Instruction" SRAM, a small "Level-1 Data" SRAM, and a larger "Level-2" SRAM. Each of the three can be partitioned between Cache and directly-addressed memory, with the details depending on the specific chip. For example, a chip may allow no cache, 1/4 SRAM as cache, 1/2 SRAM as cache, or all SRAM as cache. (The ratios are limited so the allowed cache sizes can be indexed efficiently.)
The IBM "Cell" processor (used in the Sony PlayStation 3, released in 2006) was a multi-core chip with one ordinary general-purpose core and eight co-processor cores. The co-processor cores had a limited instruction set, with load and store instructions that could only access their private 128KiB "scratchpad" memory. In order to access main memory, the co-processors had to program a DMA engine to perform a block copy of main memory to local scratchpad memory (or vice versa). This approach provided (and required) perfect control over data motion, resulting in (a very small amount of) very high-performance software.
Some GPUs also have small on-chip SRAMs that can be configured as either an L1 cache or as explicitly controlled local memory.
All of these are considered to be "very hard" (or worse) to use, but this can be the right approach if the product requires very low cost, completely predictable performance, or very low power.
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
Of course, a normal load will definitely bring a cache line into L1d cache, at least temporarily. Nothing stops it from being evicted later, though. e.g. on x86-64: mov eax, [rdi] instead of prefetcht0 [rdi].
Before dedicated prefetch instructions existed, using a plain load as a prefetch was sometimes done (e.g. ahead of some loop-bounds calculations before entering a loop that would start looping over an array). For performance purposes, best-effort software prefetch instructions that the CPU can ignore are usually better.
A plain load has the downside of not being able to retire from the out-of-order back-end until the loaded data actually arrives. (At least I think it can't on x86 CPUs with x86's strongly ordered memory model. Weakly-ordered ISAs that allow out-of-order loads might let the load retire even if it hasn't truly completed yet.) Software prefetch instructions exist to allow prefetch as a hint without bottlenecking the CPU on waiting for the load to finish.
On modern x86, forced eviction of a cache is possible. NT stores guarantee that on Pentium-M or newer, or CPUs after Pentium-M, I forget which. Also, clflush and clflushopt exist specifically for that.
clflush is not just a hint that the CPU can drop; it guarantees correctness for non-volatile DIMMs like Optane DC PM. Why does CLFLUSH exist in x86?
Being guaranteed, not just a hint, makes it slow. You generally don't want to do this for performance. As #old_timer says, burning instructions / cycles micro-managing the cache is almost always a waste of time. Leaving things up to the hardware's pseudo-LRU replacement and HW prefetch algorithms usually provide good results in the long run. SW prefetch can help in a few cases.
Xeon Phi can configure its MCDRAM as a large last-level cache, or as architecturally visible "local memory" that's part of physical address space. But at 6 to 16GiB, it's vastly bigger than the on-die L1/L2 caches, or the L1/L2/L3 caches of modern mainstream CPUs.
Also, x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all when no-fill-mode is activated. i.e. only cache is available, and you have to be careful not to evict anything that was cached. It's not usable for any practical purpose except early-boot.
What use is the INVD instruction? and Cache-as-Ram (no fill mode) Executable Code have some details.
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Direct access to the cache srams has nothing to do with the instruction set, if you have access then you have access and you access it however the chip/system designers implemented it. It could be as simple as an address space or it may be some indirect peripheral like access where you poke at control registers and that logic accesses that item in the cache for you.
And this doesn't mean that all ARM processors can gain access to their cache in the same way. (arm is an IP company not a chip company) but it might mean that no you can't do this on any existing x86s. I know for a fact on the product I am part of we can do this because we have ECC on those SRAMs and have an access method to initialize the rams from software before enabling the monitor. Some of the srams you can do it through normal accesses, but for example the arm we are using was implemented with parity checking not ECC so we added ECC on the SRAM and a side door access for init because trying to go through the cache with normal accesses and get 100% coverage was a PITA and end the end not the right solution.
Also worked on a product where the dram controller cache can be used direct access as an on chip ram, up to software decide how to use it as an L2 cache or as on chip ram.
So it has and can be done, and these are isolated examples. As part of screening the parts there are mbist tests that run, but often those are driven through jtag and not directly available to the processor and/or the ram isn't, sometimes the mbist can be started and checked by software but the ram can't, and some implementations, the designers made it so software can touch all of it, including tag ram.
Which leads to if you think you can do a better job than the hardware and want to move stuff around then you will also likely need access to the tag ram as well so that you can trace/drive where you want the cache line, its status, etc.
Based on this comment:
Sorry, I'm a [beginner] at assembly, could you please explain this simpler? whats a CPU "mode"? What's that HBM? How to set a CPU mode? what are NDAs? – KGM
Two things, you can't do better than the cache, and two, you are not ready for this task.
Even with experience you can't generally do better than the cache, if you want to manipulate the cache you use the same knowledge as to how you write your code and where you place it in memory as well as where the data is you are using and then the logic implementation can work better for you. Burning instructions and cycles trying to reposition things runtime isn't going to help. You generally need access to the design at level that is not available to the general public. Thus an NDA (non disclosure agreement), and even then it is extremely unlikely that you will get the info you need and/or the gains will be minimal, may only work on one implementation and not across the whole family of products, etc.
More interesting is what do you think you can do better and how are you thinking you can do it? (also understand that many of us here can make any cache implementation fail and run slower than if it wasn't there, even if you create a newer better cache, by definition it only improves performance in certain cases).
Suppose that I have an embedded project (with an ARM Cortex-M if it makes a difference) where parts of the code are critical and need to run fast and in a deterministic time as much as possible.
Would it be possible to sacrifice part of the L1 cache and reserve it for the critical code/data? I could then load the critical code/data and always run/access them at L1 cache speeds.
Ok I think the answer is "technically speaking, no". Memory allocated as cache memory is used by the cache controller to do what it should, and that is caching.
So hopefully the chip vendor has provided ways to run code from the fastest memory available. If the chip has TCM, then loading your critical code there should be fine and run as fast as it would run when cached in L1 cache. If the chip provides flash and RAM, then loading critical code on RAM should also be much faster. In the latter case, the cache controller, if it exists, may be configured to use the same RAM for running cached code anyway.
Yes, it is possible:
TB3186
"How to Achieve Deterministic Code Performance Using aCortex™-M Cache Controller"
http://ww1.microchip.com/downloads/en/DeviceDoc/How-to-Achieve-Deterministic-Code-Performance-using-CortexM-Cache-Controller-DS90003186A.pdf
...
With CMCC, a part of the cache can be used as TCM for deterministic code performance by loading the critical code in a WAY and locking it. When a particular WAY is locked, the CMCC does not use the locked WAY for routine cache transactions. The locked cache WAY with the loaded critical code acts as an always-getting cache hit condition.
Could someone explain what is the difference between cache memory and scratchpad memory? I'm currently learning about computer architecture.
A scratchpad is just that a place to keep some stuff. Cache, is memory you talk through normally not talk at. Scratchpad is like a post it note, something you write something on and keep with you. Cache is paper you send off to someone else with instructions like a memo.
Cache can be in various places, layers (L1, L2, L3...). both scratchpad and cache are just sram in some chip, with an address and data bus and read/write/etc control signals. (as are many other things in a computer which may or may not be used for addressable ram). During boot, before the ram on the far side (slower ram side, processor being the near side) is initialized (eventually dram typically if you have a cache otherwise why have a cache) it may be possible to access the cache as addressable ram. It depends very much on the system/design though, there may be a control register that enables it to behave as a simple ram, or there may be a mode, or its normal mode may be such that so long as you dont address more than the size of the ram based on its alignment (perhaps a 32K ram between 32K boundaries) then it may not try to evict anything and generate bus cycles on the dram/slow/far side of the cache allowing you to use it as ram just like a scratchpad.
BUT, the normal use case for a cache is as an ideally invisible pathway to ram. You dont access the cache ram using cache addressing you use the address space of the ram beyond and the cache simply allows the processor to continue without waiting for the slow ram.
Talking about booting again, think about the kinds of things you need to do when booting, namely bringing up the dram controller, which is most definitely a non-trivial thing. Having some on chip memory allows you to if nothing else temporarily have some ram for a small stack and for some variables. You can for example us a compiler on a compiled language like C which needs at a minimum some ram for stack and variables. Depending on space you can put some program there too, likely running there much faster than from flash. The alternative to having no ram is likely having to write the dram init in assembly using only general purpose or other registers in the processor, taking a complicated task and making it that much more difficult. Once the main system ram is up, then you may or may not choose to not use the on chip (scratchpad) ram.
I would and do argue that if you want to test the dram to see if it is working then you need to not use that ram to test that ram, the test program should not run in nor use the ram under test. Having scratchpad ram on chip (or some other ram in the address space, perhaps video card ram for example) could be used for the dram test program. Unfortunately lots of folks will use the ram under test to hold the stack and program and variables and heap from the program doing the test, leaving important parts of the ram untested other than one or a small number of patterns.
I would like to know in what ways,using a cache buffer(eg.TLB) to cache frequent pages would be not advantageous or potentially catastrophic.
I searched a bit but could not comprehend this:
"When that page is shared with another process running on a different core of the machine. For example, in Intel 3rd generation core architecture (Ivybridge) L1 and L2 cores are private for a core and L3 is shared. So, a shared page must not go above L3 cache or otherwise explicit coherence mechanism must be done by the programmer "
I don't know if you are asking about x86 CPU caches or generally for any architecture.
As far as I know (and I didn't find any source saying otherwise), in x86, the CPU hardware always ensures that you have a cache-coherent shared memory. So you as an application or OS programmer don't need to care about this. Also there is no instruction which would allow you to deny the CPU from caching your data in L1/L2 or L3 cache. This wouldn't make any sense, as for every write the CPUs must ensure shared cache is coherent and it wouldn't know about your single non-cacheable page. Although it seems there are ways to pre-fetch some data into cache if needed.
For other (theoretical and future) architectures, it is possible to present the cache (and hence whole memory) as not coherent - then the OS must be written so that it will ensure memory coherence itself. For example this paper writes about OS design for non-cache-coherent systems (like Intel's experimental Single-Chip Cloud computer).
For writing applications "friendly" to caching, there are many good questions and answers also here in SO, few of them mentioning this 114-pdf from RedHat: What Every Programmer Should Know About Memory. Here is also Wikipedia page for Cache Coherence.
We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.
The kmalloc() memory will be contiguous in physical space. The user space will definitely not (mlock() may result in closer to contiguous). If you have several SDRAM chips, it is possible that your memory controller allow pipelining or multiple issue reads/writes to different chips simultaneously. It may even be faster with multiple banks. vmalloc() will not use contiguous pages.Ref You should be able to write a test to swap kmalloc() with vmalloc(). If something has changed with the newer ARMs and the cache is not VIVT, the difference in physical addresses could cause cache (aliasing?) effects on some processors.
I do not think that the cache are setup differently for kernel memory versus user memory; at least with 2.6.34 variants; but they may come from different pools. Also, for a memcpy() a large cache is not needed; you just need enough to make sure the SDRAM will burst.
Another issue is peripherals. For instance, a large graphics buffer on one chip maybe stealing cycles via DMA. If you can change your machine file or device table to disable as many drivers as possible, this can be eliminated. This combined with the pipelining could account for the type of slow-down observed.
I believe this is a platform issue. If it was strictly Linux, I think that one of the millions of users may have encountered it. However, you haven't given a specific version of Linux. It could be an ARM based issue; so I tagged it as such. I think it is your platform/ARM combination; simply because others would observe this. Can you also provide a specific machine file or device table that your design was based upon and the Linux version.