DRAM cache miss

DRAM cache miss - caching

I read a paragraph about DRAM(main memory) cache miss and SRAM(L1,L2,L3) cache miss and I am not sure what it means.
Since DRAM is slower than SRAM, the cost for cache misses is expensive because DRAm cache misses are served from disk, while SRAM cache misses are usually served from DRAM based main memory.
Here is my understanding :
if there is a cache miss in DRAM, it goes into disk(second memory) to find datum. while if there is a cache miss in SRAM, it goes into SRAM to find the datum.
Could you tell if I am right or wrong ?

In general, if there's a miss at level L, you have to go one level further down, L+1.
A typical memory hierarchy comprises the following levels, from 0 onwards:
Processor registers
Processor caches (SRAM)
System memory (DRAM)
Mass storage (Flash/Spinning devices)
If you want to store something in a local register, you have to first fetch it from memory.
If your data is in one of the caches of the processor (SRAM), you don't need to go further down. If you have a cache miss however, you have to go to system memory (DRAM).
What happens here is that you might try to access a memory page which is not in memory, either because it has never been loaded or because at some point it has been swapped out. You have a page fault and you need to fetch your page from storage devices. This process stops as soon as you find your data.
Note that you want to avoid as much as possible access to slow storage drives, so what you can do it to create additional caching layers between DRAM and spinning disks by means of faster devices, e.g. SSDs (ZFS L2ARC, bcache etc)

Related

Policy for writing to memory

I understand that when need to deliver data to the CPU:
On a cache miss we access the cache, access the DRAM:
a) we copy the data from the DRAM back to the cache if it is a DRAM hit.
b) we copy the data from the disk to the DRAM and then from the DRAM to the cache.
On a cache hit we just access the cache.
What is the policy that we should use when we write to memory?
For example:
In every write cache hit do we update the cache, DRAM, and the disk?
For every write miss, do we write to the disk, read that disk block into DRAM,
and then read the DRAM block into the cache?

Most modern CPUs have cache so much faster than DRAM that write-back is the only policy that makes sense. Some older CPUs or modern embedded may have write-through CPU caches when the gap between on-chip cache and DRAM isn't so huge. Either way this is hardware-managed and invisible to software.
But writes always stop at DRAM, if/when they make it that far. The "backing store" on disk is not important when the page is in DRAM. If you want to think about DRAM as cache for a memory-mapped file (or the pagefile for anonymous memory), the only write policy that makes sense for performance is write-back!
Write-back to disk is managed by software, so implementing a write-through policy would require making every store trap to the OS after committing to DRAM, at which point the OS would have to run a bunch of code to initiate a SATA write command of the whole page. (And would have to do this without accessing any DRAM itself, otherwise how would those writes get in sync on disk? Or maybe you'd let yourself off the hook here because kernel memory is generally not pageable, so this kernel code is only backed by DRAM, not ultimately by disk pages.)
Even if disk-write was efficiently possible with byte or word granularity (which it very much isn't unless your "disk" is actually non-volatile RAM like 3D XPoint (e.g. Optane DC Persistent Memory), or battery-backed DRAM), just trapping every store would still destroy performance, like hundreds of times slower.
The gap between DRAM and disk has always been huge; hardware doesn't have mechanisms to make efficient write-through to "disk" possible. Other than modern non-volatile storage connected to the memory bus so it can be truly memory-mapped, like Linux mmap(MAP_SYNC). But then there's no plain DRAM in between cpu-cache and persistent NV-DRAM
I/O vs. DRAM performance; random DRAM writes (on a modern x86, using cache-bypassing NT stores) takes something like ~60ns with 64-byte granularity (for a burst write of a full cache line), including time spent getting the store from a CPU core to a memory controller. (60ns is actually something like the L3-miss load-use latency for reads but I'm going to assume something similar for NT stores.)
Random disk writes to a rotational magnetic disk take about 10ms, so that's about 6 orders of magnitude slower. And to even detect
Also, disk writes have a minimum size of usually 512 or 4096 bytes (1 hardware sector), so to write 1 byte or word, or a CPU cache line, would take a read-modify-write cycle for the disk.

What happens in CPU, cache and memory when CPU is instructed to store data to memory?

Let's suppose the memory hierarchy is 1 cpu with L1i, L1d ,L2i L2d,L3, DRAM.
I'm wondering what happens at the lower levels of the computer when I use MOV/store instruction (or any other instruction that will cause CPU transfer data to memory)? I know what happens if there is just CPU and memory, but with the caches I'm a bit confused. I've searched for this, but it only yielded information about data transfer between:
registers and memory
CPU and cache,
cache and memory
I'm trying to understand more about this, like when cache will write through, when will write back? I just know write through is that update cache line and corresponding memory line immediately and write back is that update until replacement.Can they coexist? Is it the data will transfer directly to memory in write through? and in write back the data will through cache hierarchy?
What caused my confusion is that the Volatile in C/C++.As I known those type of variable will store in memory directly which means don’t through cache.Am I right? So what if I define a Volatile variable and a normal variable like int . how can the CPU distinguish that write directly to memory or through cache hierarchy.
Is there any instruction that can control cache? If not, how is cache
controlled? Some other hardware? OS? Cache controller(if such a thing exists)?

if cache miss happens, the data will be moved to register directly or first moved to cache then to register?

if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?

I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.

Virtual Address to Physical address translation in the light of the cache memory

I do understand how the a virtual address is translated to a physical address to access the main memory. I also understand how the cache memory works as well.
But my problem is in putting the 2 concepts together and understanding the big picture of how a process accesses memory and what will happen if we have a cache miss. so i have this drawing that will help me asks the following questions:
click to see the image ( assume one-level cache)
1- Does the process access the cache with the exact same physical address that represent the location of byte in the main memory ?
2- Is the TLB actually in the first level of Cache or is it a separate memory inside the CPU chip that is dedicated for the translation purpose ?
3- When there is a cache miss, i need to get a whole block and allocated in the cache, but the main memory organized in frames(pages) not blocks. So does a process page is divided itself to cache blocks that can be brought to cache in case of a miss ?
4- Lets assume there is a TLB miss, does that mean that I need to go all the way to the main memory and do the page walk there , or does the page walk happen in the cache ?
5- Does a TLB miss guarantee that there will be a cache miss ?
6- If you have any reading material that explain the big picture that i am trying to understand i would really appreciate sharing it with me.
Thanks and feel free to answer any single question i have asked

Yes. The cache is not memory that can be addressed separately. Cache mapping will translate a physical address into an address for the cache but this mapping is not something a process usually controls. For some CPU architecture it is completely controlled by the hardware (e.g. Intel x86). For others the operating system would be expected to program the mapping.
The TLB in the diagram you gave is for virtual to physical address mapping. It is probably not for the cache. Again on some architecture the TLBs are programmed whereas on others it is controlled by the hardware.
Page size and cache line size do not have to be the same as one relates to virtual memory and the other to physical memory. When a process access a virtual address that address will be translated to a physical address using the TLB considering page size. Once that's done the size of a page is of no concern. The access is for a byte/word at a physical address. If this causes a cache miss occurs then the cache block that will be read will be of the size of a cache block that covers the physical memory address that's being accessed.
TLB miss will require a page translation by reading other memory. This process can occur in hardware on some CPU (such as Intel x86/x64) or need to be handled in software. Once the page translation has been completed the TLB will be reloaded with the page translation.
TLB miss does not imply cache miss. TLB miss just means the virtual to physical address mapping was not known and required a page address translation to occur. A cache miss means the physical memory content could not be provided quickly.
To recap:
the TLB is to convert virtual addresses to physical address quickly. It exist to cache the virtual to physical memory mapping quickly. It does not have anything to do with physical memory content.
the cache is to allow faster access to memory. It is only there to provide the content of physical memory faster.
Keep in mind that the term cache can be used for lots of purposes (e.g. note the usage of cache when describing the TLB). TLB is a bit more specific and usually implies a virtual memory translation though that's not universal. For example some DMA controllers have a TLB too but that TLB is not necessarily used to translate virtual to physical addresses but rather to convert block addresses to physical addresses.

Is Translation Lookaside Buffer (TLB) the same level as L1 cache to CPU? So, Can I overlap virtual address translation with the L1 cache access?

I am trying to understand the whole structure and concepts about caching. As we use TLB for fast mapping virtual addresses to physical addresses, in case if we use virtually-indexed, physically-tagged L1 cache, can one overlap the virtual address translation with the L1 cache access?

Yes, that's the whole point of a VIPT cache.
Since the virtual addresses and physical one match over the lower bits (the page offset is the same), you don't need to translate them. Most VIPT caches are built around this (note that this limits the number of sets you can use, but you can grow their associativity instead), so you can use the lower bits to do a lookup in that cache even before you found the translation in the TLB.
This is critical because the TLB lookup itself takes time, and the L1 caches are usually designed to provide as much BW and low latency as possible to avoid stalling the often much-faster execution.
If you miss the TLB and suffer an even greater latency (either some level2 TLB or, god forbid, a page walk), it's less critical since you can't really do anything with the cache lookup until you compare the tag, but the few cycles you did save in the TLB hit + cache hit case should be the common case on many applications, so that's usually considered worthy to optimize and align the pipelines for.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio