Policy for writing to memory - caching

I understand that when need to deliver data to the CPU:
On a cache miss we access the cache, access the DRAM:
a) we copy the data from the DRAM back to the cache if it is a DRAM hit.
b) we copy the data from the disk to the DRAM and then from the DRAM to the cache.
On a cache hit we just access the cache.
What is the policy that we should use when we write to memory?
For example:
In every write cache hit do we update the cache, DRAM, and the disk?
For every write miss, do we write to the disk, read that disk block into DRAM,
and then read the DRAM block into the cache?

Most modern CPUs have cache so much faster than DRAM that write-back is the only policy that makes sense. Some older CPUs or modern embedded may have write-through CPU caches when the gap between on-chip cache and DRAM isn't so huge. Either way this is hardware-managed and invisible to software.
But writes always stop at DRAM, if/when they make it that far. The "backing store" on disk is not important when the page is in DRAM. If you want to think about DRAM as cache for a memory-mapped file (or the pagefile for anonymous memory), the only write policy that makes sense for performance is write-back!
Write-back to disk is managed by software, so implementing a write-through policy would require making every store trap to the OS after committing to DRAM, at which point the OS would have to run a bunch of code to initiate a SATA write command of the whole page. (And would have to do this without accessing any DRAM itself, otherwise how would those writes get in sync on disk? Or maybe you'd let yourself off the hook here because kernel memory is generally not pageable, so this kernel code is only backed by DRAM, not ultimately by disk pages.)
Even if disk-write was efficiently possible with byte or word granularity (which it very much isn't unless your "disk" is actually non-volatile RAM like 3D XPoint (e.g. Optane DC Persistent Memory), or battery-backed DRAM), just trapping every store would still destroy performance, like hundreds of times slower.
The gap between DRAM and disk has always been huge; hardware doesn't have mechanisms to make efficient write-through to "disk" possible. Other than modern non-volatile storage connected to the memory bus so it can be truly memory-mapped, like Linux mmap(MAP_SYNC). But then there's no plain DRAM in between cpu-cache and persistent NV-DRAM
I/O vs. DRAM performance; random DRAM writes (on a modern x86, using cache-bypassing NT stores) takes something like ~60ns with 64-byte granularity (for a burst write of a full cache line), including time spent getting the store from a CPU core to a memory controller. (60ns is actually something like the L3-miss load-use latency for reads but I'm going to assume something similar for NT stores.)
Random disk writes to a rotational magnetic disk take about 10ms, so that's about 6 orders of magnitude slower. And to even detect
Also, disk writes have a minimum size of usually 512 or 4096 bytes (1 hardware sector), so to write 1 byte or word, or a CPU cache line, would take a read-modify-write cycle for the disk.

Related

Could multi-cpu access memory simultaneously in common home computer?

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing ยต-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

if cache miss happens, the data will be moved to register directly or first moved to cache then to register?

if cache miss happens, the data will be moved to register directly from main memory, or the data firstly will be moved to cache then to register? Is there a direct way connect the register with main memory?
I think you're asking if a cache-miss load has to wait for L1 load-use latency after the cache line arrives from outer cache. i.e. wait for the line to be written to L1, then retry the load normally.
I'm almost certain that high-performance CPUs don't work that way. L2-hit latency is important for many workloads, and you need a load buffer tracking that incoming cache line anyway to know when to restart the load. So you just grab the data as it comes in, in parallel with writing it to the cache. The TLB check was already done as part of generating a physical address to send to the outer cache.
Most real CPUs use an early-restart design that lets the pipeline restart as soon as the word / byte they were waiting for arrives, so the rest of the cache line transfers "in the background".
A further optimization is critical-word-first, which asks for the cache line to be sent starting with the needed word, so a demand miss for a word in the middle of a cache line can receive that word first. I think modern DDR DRAM still supports this when reading from main memory, starting the 64-byte burst at a specified 64-bit chunk. I'm not 100% sure modern out-of-order CPUs use this, though; when out-of-order execution allows multiple outstanding misses for the same line, it probably makes it more complicated.
See which is optimal a bigger block cache size or a smaller one? for some discussion of early-restart and critical-word-first.
Is there a direct way connect the register with main memory?
It depends what you mean by "direct". In a modern high-performance CPU, there will be 2 or 3 layers of cache and a memory controller with its own buffering to arbitrate access to memory for multiple cores. So no, you can't.
If you design a simple single-core CPU with special cache-bypassing load and store instructions, then sure. Or if you consider early-restart as "direct", then yes it already happens.
For stores, x86 and some other architectures have cache-bypassing stores, but x86's MOVNT instructions don't directly connect registers with memory. Stores go into a line-fill buffer which is flushed when full, so you get write-combining.
There's also uncacheable memory regions: a load or store to uncacheable memory is architecturally "direct", but in the actually microarchitecture it still goes through the memory hierarchy from the load/store execution unit through the same mechanism that L1D uses to talk to the memory controller.

DRAM cache miss

I read a paragraph about DRAM(main memory) cache miss and SRAM(L1,L2,L3) cache miss and I am not sure what it means.
Since DRAM is slower than SRAM, the cost for cache misses is expensive because DRAm cache misses are served from disk, while SRAM cache misses are usually served from DRAM based main memory.
Here is my understanding :
if there is a cache miss in DRAM, it goes into disk(second memory) to find datum. while if there is a cache miss in SRAM, it goes into SRAM to find the datum.
Could you tell if I am right or wrong ?
In general, if there's a miss at level L, you have to go one level further down, L+1.
A typical memory hierarchy comprises the following levels, from 0 onwards:
Processor registers
Processor caches (SRAM)
System memory (DRAM)
Mass storage (Flash/Spinning devices)
If you want to store something in a local register, you have to first fetch it from memory.
If your data is in one of the caches of the processor (SRAM), you don't need to go further down. If you have a cache miss however, you have to go to system memory (DRAM).
What happens here is that you might try to access a memory page which is not in memory, either because it has never been loaded or because at some point it has been swapped out. You have a page fault and you need to fetch your page from storage devices. This process stops as soon as you find your data.
Note that you want to avoid as much as possible access to slow storage drives, so what you can do it to create additional caching layers between DRAM and spinning disks by means of faster devices, e.g. SSDs (ZFS L2ARC, bcache etc)

Difference between ZRAM and ZSWAP

Does anyone know what is the difference between feature ZRAM and ZSWAP in linux kernel? Seems they are very similar-- store compressed pages in ram.
zram
Status: Available in mainline kernel as of version 3.14 (March 2014)
Implementation: compressed block device, memory is dynamically
allocated as data is stored
Usage: Configure zram block device as a swap device to eliminate need
for physical swap defice or swap file
Benefits:
Eliminates need for physical swap device. This beame popular when
netbooks first showed up. Zram (then compcache) allowed users to
avoid swap shortening the lifespan of SSDs in these memory
constrained systems.
A zram block device can be used for other applications other than
swap, anything you might use a block device for conceivably.
Drawbacks:
Once a page is stored in zram it will remain there until paged in or
invalidated. The first pages to be paged out will be the oldest
pages (LRU list), these are 'cold' pages that are infrequently
access. As the system continues to swap it will move on to pages
that are warmer (more frequently accessed), these may not be able to
be stored because of the swap slots consumed by the cold pages. What
zram can not do (compcache had the option to configure a block
backing device) is to evict pages out to physical disk. Ideally you
want to age data out of the in-kernel compressed swap space out to
disk so that you can use kernel memory for caching warm swap pages
or free it for more productive use.
zswap
Status: Available in mainline kernel as of version 3.11 (September 2013)
Implementation: compressed in-kernel cache for swap pages. In-kernel
cache is compressed, the compression algorithm is pluggable using the
CryptoAPI and the storage for pages is dynamically allocated. Older
pages can be evicted to disk making this a sort of write-behind
cache.
Usage: Cache swap pages destined for regular swap devices (or swap
files).
Benefits:
Integration with swap code (using Frontswap API) allows zswap to
choose to store only pages that compress well and handle memory
allocation failures, in those cases pages are sent to the backing
swap device.
Oldest pages in the cache are pushed out to backing swap device to
make room for newer pages, this solves the LRU inversion problem
that a lack of page eviction would present.
Drawbacks:
Needs a physical swap device (or swapfile).
ZRAM is a module of the Linux kernel, previously called "compcache". ZRAM increases performance by avoiding paging on disk and instead uses a compressed block device in RAM in which paging takes place until it is necessary to use the swap space on the hard disk drive. Since using RAM is faster than using disks, zram allows Linux to make more use of RAM when swapping/paging is required, especially on older computers with less RAM installed.
ZSWAP is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are
faster than reads from a swap device.

How are cache memories shared in multicore Intel CPUs?

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/

Resources