Cache for heap memory access

Cache for heap memory access - caching

In general desktops have 2 kinds of CPU cache to faster memory access.
1) Instruction cache -> to speed up executable instructions.
2) Data cache -> to speed up data fetch and store.
As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?

Instruction cache operates on code segment of a program and Data cache operates on data segment of program. Is this right?
No, CPU is unaware about segments.
Instruction cache is for all execution accesses, whether they are performed inside code segment, or in the heap as dynamically created code.
Data cache is for all other, non-execution accesses. Data can be in the data segment, heap, or even in the code segment as constants.

As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?
Memory is memory. The CPU can't tell the difference between the heap and the data.
Instruction caches usually just start with the address in the program counter and grab the next N bytes. The CPU still can't tell if its a code segment or a data segment.

When you write a program, this gets translated into machine readable binary. When CPU executes instructions, it fetches this binary, decodes, what it mean and then execute. Basically this binary tells CPU what instructions it has to execute. If this binary was only stored in main memory, then during each fetch stage, CPU has to access main memory, which is really bad. Instead what we do is to store, some of it in a cache, closer to the CPU. Since this cache only contain binary information related to the instructions to be executed, we call it instruction cache. Now instructions need data to operate. In your high level code you might have something like
arrayA[i] = (arrayB[i] + arrayC[i]) which will translate into a machine instruction something similar to
ADD memLocationStoredInRegisterA, memLocationStoredInRegisterB, memLocationStoredInRegisterC
This instruction is stored in instruction cache, but the data, i.e arrayA, arrayB and arrayC will be stored in another portion of memory. Again it will be useless to access main memory, each time this instruction is executed. Therefore we store some of this in another cache, which we call data cache.

Related

Why does DSB not flush the cache?

I'm debugging an HTTP server on STM32H725VG using LWIP and HAL drivers, all initially generated by STM32CubeMX. The problem is that in some cases data sent via HAL_ETH_Transmit have some octets replaced by 0x00, and this corrupted content successfully gets to the client.
I've checked that the data in the buffers passed as arguments into HAL_ETH_Transmit are intact both before and after the call to this function. So, apparently, the corruption occurs on transfer from the RAM to the MAC, because the checksum is calculated on the corrupted data. So I supposed that the problem may be due to interaction between cache and DMA. I've tried disabling D-cache, and then the corruption doesn't occur.
Then I thought that I should just use __DSB() instruction that should write the cached data into the RAM. After enabling D-cache back, I added __DSB() right before the call to HAL_ETH_Transmit (which is inside low_level_output function generated by STM32CubeMX), and... nothing happened: the data are still corrupted.
Then, after some experimentation I found that SCB_CleanDCache() call after (or instead of) __DSB() fixes the problem.
This makes me wonder. The description of DSB instruction is as follows:
Data Synchronization Barrier acts as a special kind of memory barrier. No instruction in program order after this instruction executes until this instruction completes. This instruction completes when:
All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this instruction complete.
And the description of SCB_DisableDCache has the following note about SCB_CleanDCache:
When disabling the data cache, you must clean (SCB_CleanDCache) the entire cache to ensure that any dirty data is flushed to external memory.
Why doesn't the DSB flush the cache if it's supposed to be complete when "all explicit memory accesses" complete, which seems to include flushing of caches?

dsb ish works as a memory barrier for inter-thread memory order; it just orders the current CPU's access to coherent cache. You wouldn't expect dsb ish to flush any cache because that's not required for visibility within the same inner-shareable cache-coherency domain. Like it says in the manual you quoted, it finishes memory operations.
Cacheable memory operations on write-back cache only update cache; waiting for them to finish doesn't imply flushing the cache.
Your ARM system I think has multiple coherency domains for microcontroller vs. DSP? Does your __DSB intrinsic compile to a dsb sy instruction? Assuming that doesn't flush cache, what they mean is presumably that it orders memory / cache operations including explicit flushes, which are still necessary.

I'd put my money on performance.
Flushing cache means to write data from cache to memory. Memory access is slow.
L1 cache size (assuming ARM Cortex-A9) is 32KB. You don't want to move a whole 32KB from cache into memory for no reason. There might be L2 cache which is easily 512KB-1MB (could be even more). You really don't want to move a whole L2 either.
As a matter of fact your whole DMA transfer might be smaller than size of caches. There is simply no justification to do that.

What happens in CPU, cache and memory when CPU is instructed to store data to memory?

Let's suppose the memory hierarchy is 1 cpu with L1i, L1d ,L2i L2d,L3, DRAM.
I'm wondering what happens at the lower levels of the computer when I use MOV/store instruction (or any other instruction that will cause CPU transfer data to memory)? I know what happens if there is just CPU and memory, but with the caches I'm a bit confused. I've searched for this, but it only yielded information about data transfer between:
registers and memory
CPU and cache,
cache and memory
I'm trying to understand more about this, like when cache will write through, when will write back? I just know write through is that update cache line and corresponding memory line immediately and write back is that update until replacement.Can they coexist? Is it the data will transfer directly to memory in write through? and in write back the data will through cache hierarchy?
What caused my confusion is that the Volatile in C/C++.As I known those type of variable will store in memory directly which means don’t through cache.Am I right? So what if I define a Volatile variable and a normal variable like int . how can the CPU distinguish that write directly to memory or through cache hierarchy.
Is there any instruction that can control cache? If not, how is cache
controlled? Some other hardware? OS? Cache controller(if such a thing exists)?

if cache miss happens, the data will be moved to register directly or first moved to cache then to register?

if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?

I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.

____cacheline_aligned_in_smp for structure in the Linux kernel

In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how?

____cacheline_aligned instructs the compiler to instantiate a struct or variable at an address corresponding to the beginning of an L1 cache line, for the specific architecture, i.e., so that it is L1 cache-line aligned. ____cacheline_aligned_in_smp is similar, but is actually L1 cache-line aligned only when the kernel is compiled in SMP configuration (i.e., with option CONFIG_SMP). These are defined in file include/linux/cache.h
These definitions are useful for variables (and data structures) that are not allocated dynamically, via some allocator, but are global, compiler-allocated variables (a similar effect can be accomplished by dynamic memory allocators that can allocate memory at specific alignment).
The reason for cache-line aligned variables is to manage the cache-to-cache transfers of these variables, by hardware cache coherence mechanisms, in SMP systems, so that their movement does not implicitly occur when other variables are moved. This is for performance critical code, where one expects contention in the access of variables by multiple cpus (cores). The usual problem one tries to avoid, in this case, is false sharing.
A variable's memory starting at the beginning of a cache line is half the work for this purpose; one also needs to "pack with it" only variables that should move together. An example is an array of variables, where each element of the array is to be accessed by only one cpu (core):
struct my_data {
long int a;
int b;
} ____cacheline_aligned_in_smp cpu_data[ NR_CPUS ];
This kind of definition will require from the compiler (in an SMP configuration of the kernel) that each cpu's struct will begin at a cache line boundary. The compiler will, implicitly, allocate extra space after each cpu's struct, so that the next cpu's struct will begin at a cache line boundary, also.
An alternative is to pad the data structure with a cache line's size of dummy, unused bytes:
struct my_data {
long int a;
int b;
char dummy[L1_CACHE_BYTES];
} cpu_data[ NR_CPUS ];
In this case, only dummy, unused data will be moved unintentionally and those actually accessed by each cpu will only move from cache to memory and vise versa, due to cache capacity misses.

Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks that cache line as dirty. In remaining CPU cache line it becomes stale entry, which needs to be flushed and re-fetched from memory. This might lead to cache line misses, which requires more CPU cycles. This reduces the performance of the system. Remember this is for global variables. Most of the kernel data strucuters use this to avoid cache line misses.

Linux manages the CPU Cache in a very similar fashion to the TLB. CPU caches, like TLB caches, take advantage of the fact that programs tend to exhibit a locality of reference. To avoid having to fetch data from main memory for each reference, the CPU will instead cache very small amounts of data in the CPU cache. Frequently, there is two levels called the Level 1 and Level 2 CPU caches. The Level 2 CPU caches are larger but slower than the L1 cache, but Linux only concerns itself with the Level 1 or L1 cache.
CPU caches are organised into lines. Each line is typically quite small, usually 32 bytes and each line is aligned to it's boundary size. In other words, a cache line of 32 bytes will be aligned on a 32 byte address. With Linux, the size of the line is L1_CACHE_BYTES which is defined by each architecture.
How addresses are mapped to cache lines vary between architectures but the mappings come under three headings, direct mapping, associative mapping and set associative mapping. Direct mapping is the simplest approach where each block of memory maps to only one possible cache line. With associative mapping, any block of memory can map to any cache line. Set associative mapping is a hybrid approach where any block of memory can map to any line but only within a subset of the available lines.
Regardless of the mapping scheme, they each have one thing in common, addresses that are close together and aligned to the cache size are likely to use different lines. Hence Linux employs simple tricks to try and maximise cache usage
Frequently accessed structure fields are at the start of the
structure to increase the chance that only one line is needed to
address the common fields;
Unrelated items in a structure should try
to be at least cache size bytes apart to avoid false sharing between
CPUs;
Objects in the general caches, such as the mm_struct cache, are
aligned to the L1 CPU cache to avoid false sharing.
If the CPU references an address that is not in the cache, a cache miss occurs and the data is fetched from main memory. The cost of cache misses is quite high as a reference to cache can typically be performed in less than 10ns where a reference to main memory typically will cost between 100ns and 200ns. The basic objective is then to have as many cache hits and as few cache misses as possible.
Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches. The hooks are placed in locations where the virtual to physical mapping changes, such as during a page table update. The CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when the virtual address is being flushed from the cache.
More Information here

Instruction cache for pipelined simulator

I am trying to complete a simulator based for a simplified mips computer using java. I believe I have completed the pipeline logic needed for my assignment but I am having a hard time understanding what the instruction and data caches are supposed to do.
The instruction cache should be direct-mapped with 4 blocks and the block size is 4 words.
So I am really confused on what the cache is doing. Is it going to memory and pulling the instruction from memory? For example, in one block it will have just the add command.
Would it make sense to implement it as a 2 dimensional array?

First you should know the basics of the cache. You can imagine cache as an intermediate memory which sits between the DRAM or main memory and your processor, however very much limited in size. Now, when you try to access a location in memory, you will search it first in the cache. If it is found (cache hit) the processor will take this data and resume the execution. Generally the cache hit is supposed to be very few clock cycles lets say 1 or 2. Suppose if the data is not found in the cache (cache miss), then the data is fetched from the main memory, filled in the cache and fed to the processor. The processor blocks till the data is fetched. This takes few hundreds of clock cycles normally depending on the DRAM you are using. The amount of data that is fetched from DRAM is equal to cacheline size. For that you should search for spatial locality of reference in caches.
I think this should get you a start.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio