why are separate icache and dcache needed [duplicate] - caching

This question already has an answer here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Closed 2 years ago.
Can someone please explain what do we gain by having a separate instruction cache and data cache.
Any pointers to a good link explaining this will also be appreciated.

The main reason is: performance. Another reason is power consumption.
Separate dCache and iCache makes it possible to fetch instructions and data in parallel.
Instructions and data have different access patterns.
Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. For example, the AMD Software Optimization Guide for 10h and 12h Processors states that:
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache.
Intel Nehalem CPU features a loopback buffer, and in addition to this the Sandy Bridge CPU features a µop cache The microarchitecture of Intel, AMD and VIA CPUs. Note that these are features related to code, and have no direct counterpart in relation to data. They benefit performance, and since Intel "prohibits" CPU designers to introduce features which result in excessive increase of power consumption they presumably also benefit total power consumption.
Most CPUs feature a data forwarding network (store to load forwarding). There is no "store to load forwarding" in relation to code, simply because code is being modified much less frequently than data.
Code exhibits different patterns than data.
That said, most CPUs nowadays have unified L2 cache which holds both code and data. The reason for this is that having separate L2I and L2D caches would pointlessly consume the transistor budget while failing to deliver any measurable performance gains.
(Surely, the reason for having separate iCache and dCache isn't reduced complexity because if the reason was reduced complexity than there wouldn't be any pipelining in any of the current CPU designs. A CPU with pipelining is more complex than a CPU without pipelining. We want the increased complexity. The fact is: the next CPU design is (usually) more complex than the previous design.)

It has to do with which functional units of the CPU primarily access that cache. Since the ALU and FPU access the data cache which the decoder and scheduler access the instruction cache, and often pipelining allows the instruction processor and the execution unit to work simultaneously, using a single cache would cause contention between these two components. By separating them we lose some flexibility and gain the ability for these two major components of the processor to fetch data from cache simultaneously.

One reason is reduced complexity - you can implement a shared cache that can retrieve multiple lines at once, or just asynchronously (see Hit-Under-Miss), but it makes the cache controller far more complicated.
Another reason is execution stability - if you have a known amount of icache and dcache, caching of data cannot starve the cache system of instructions, which may occur in a simplistic shared cache.
And as Dan stated, having them separated makes pipelining easier, without adding to the controller complexity.

As processor's MEM and FETCH stages can access L1 cache(assume combined) simultaneously, there can be conflict as which one to give priority(can become performance bottleneck). One way to resolve this is to make L1 cache with two read ports. But increasing the number of ports increases the cache area quadratically and hence increased power consumption.
Also, if L1 cache is the combined one then there are chances that some data blocks might replace blocks containing instructions which were important and about to get accessed. These evictions and followed cache miss can hurt the overall performance.
Also, most of the time processor fetches instructions sequentially(few exceptions like taken targets, jumps etc) which gives instruction cache more spatial locality and hence good hit rate. Also, as mentioned in other answers, there are hardly any writes to the ICache(self-modifying code such as JIT compilers). So separate icache and dcache designs can be optimized considering their access patterns and other components like Load/store queues, write buffers etc.

There are generally 2 kinds of architectures 1. von neuman architecture and 2. the harward architecture. The harward architecture uses 2 separate memories. you can get more on this on this arm page http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3839.html

Related

Sequential program cache consistency

I wanted to ask you a question regarding the consistency of the cache memory.
If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.
Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.
Quite right?
In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).
In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.
If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.
On the other hand, many GPUs do not implement automatic cache coherence because
the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)
so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.

Memory models-/Cache coherence protocols: How TSO goes together with MESIF

Having just worked through my system's programming lecture material, I stumbled upon the crucial concepts of memory models as well as cache coherence protocols. Although they make sense as independent concepts, it is not really clear how they go together. Specifically, when looking at x86, I am working with an ISA enforcing the TSO memory model, and a CPU (in the case of Intel) using the MESIF cache coherence protocol.
In the beginning, the professor introduced cache coherence protocols as means of ensuring that to any core in the chip, it appears as if they all access one large, monolithic block of memory. Then, after wrapping up with cache coherence, he continued with memory models, specifically TSO (we were introduced to linearizability-/sequential consistency in our parallel programming class already). The following is a direct quote from the lecture material about the x86 memory model:
Standard for 64-bit x86 processors
Sometimes called Total Store Ordering (TSO)
Earlier 32-bit x86 implemented PRAM – weaker!
Write-to-read relaxation: later reads can bypass earlier writes
All processors see writes from one processor in the order they were issued.
Processors can see different interleavings of writes from different processors.
It seems as if we "solved" the problem of slow sequential consistency by introducing what is (yet another) layer in the cache hierarchy, namely the (ordered) store buffer.
To me, TSO seems orthogonal to the principles of cache coherence. We worked so hard to get our caches to match, only to add another layer in between not covered by cache coherence.
Questions:
Why are the store buffers not covered by cache coherence protocols? Is it assumed that writeback to L1 from store buffers is so fast, that inconsistencies due to intermediate writebacks would not be an issue in the majority of cases? (i.e. I reckon store buffer -> L1 transfer only takes a few cycles, so as soon as L1 receives the data, it sends a transaction across the bus, telling other cores to invalidate their copy)
How should I think of the two concepts of cache coherence as well as memory model? The way I understand it, memory models are the theoretical concepts of what we want, and cache coherence is part of the practical implementation to achieve said model.
Thank you so much in advance for the clarifications!
Best,
Felix
The sequential consistency model is the most commonly prescribed memory model for shared memory parallel programming. A parallel-program comprising multiple tasks or threads, sequential consistency requires two conditions as described below.
Program order execution: All memory operations in each task appear to execute in that task's program order.
Memory access atomicity: Memory operations (in all tasks of the parallel program) appear to execute one-at-a-time.
Every programmer assumes these conditions to reason about their parallel programs.
Unfortunately, sequential consistency is a less useful model than imagined. The main reason is the implementation cost of these two properties. Enforcing these
properties prohibit many basic compiler and hardware optimizations[1].
Other weak/relaxed memory models are proposed that relax these properties and allow compiler and hardware optimizations. These weak memory models trade
programmability for performance.
Why are the store buffers not covered by cache coherence protocols?
It is a design choice of TSO for performance reasons. Serving a load from the store buffer or serving a load when its preceding store (of different address) is still in the store buffer, reduces the store latency. To keep store buffers coherent, the load has to wait until all other processors have acknowledged receipt of the invalidates generated by the store. Moreover, most of the time, there may not be any copies of the store address in other caches (the variable is local to a task), then waiting for the acknowledges is a waste of time. In case other tasks share this variable, then waiting for the acknowledges can be explicitly enforced by using atomic or fence instructions.
How should I think of the two concepts of cache coherence as well as memory model?
Cache coherence protocols are concerned with serializing stores to the same memory location and ensuring that a load returns the value of the most recent store to the same memory location. Cache coherence protocols are required only when there are caches or multiple copies of the same memory location, and its job is to keep all the copies coherent.
A memory consistency model is concerned with the relative order of loads and stores (of the same task) to different memory locations. Any system that involves executing shared-memory parallel programs (multiple tasks or threads communicating through a shared memory) must define its memory consistency model.
Broadly, cache coherence protocols implement a part of the memory consistency model. More precisely, it is the combination of core pipeline, and cache coherence protocols (and every other component that a memory instruction traverses) must adhere to the memory model specifications.
[1]: Shared memory consistency models: a tutorial

Could multi-cpu access memory simultaneously in common home computer?

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing µ-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

How can an unlock/lock operation on a mutex be faster than a fetch from memory?

Norvig claims, that an mutex lock or unlock operation takes only a quarter of the time that is needed to do a fetch from memory.
This answer explains, that a mutex is
essentially a flag and a wait queue and that it would only take a few instructions to flip the flag on an uncontended mutex.
I assume, if a different CPU or core tries to lock that mutex, it needs to wait for
the cache line to be written back into the memory (if that didn't already happen) and its own memory read to get the state of the flag. Is that correct? What is the difference, if it is a different core compared to a different CPU?
So the numbers Norvig states are only for an uncontended mutex where the CPU or core trying the operation already has that flag in its cache and the cache line isn't dirty?
A typical PC runs a x86 CPU, Intel's CPUs can perform the locking entirely on the caches:
if the area of memory being locked during a LOCK operation is
cached in the processor that is performing the LOCK operation as write-back memory and is completely contained
in a cache line, the processor may not assert the LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically.
This
operation is called “cache locking.”
The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
From Intel Software Developer Manual 3, Section 8.1.4
The cache coherence mechanism is a variation of the MESI protocol.
In such protocol before a CPU can write to a cached location, it must have the corresponding line in the Exclusive (E) state.
This means that only one CPU at a time has a given memory location in a dirty state.
When other CPUs want to read the same location, the owner CPU will delay such reads until the atomic operation is finished.
It then follows the coherence protocol to either forward, invalidate or write-back the line.
In the above scenario a lock can be performed faster than an uncached load.
Those times however are a bit off and surely outdated.
They are intended to give an order, along with an order of magnitude, among the typical operations.
The timing for an L1 hit is a bit odd, it isn't faster than the typical instruction execution (which by itself cannot be described with a single number).
The Intel optimization manual reports, for an old CPU like Sandy Bridge, an L1 access time of 4 cycles while there are a lot of instructions with a latency of 4 cycles of less.
I would take those numbers with a grain of salt, avoiding reasoning too much on them.
The lesson Norvig tried to teach us is: hardware is layered, the closer (from a topological point of view1) to the CPU, the faster.
So when parsing a file, a programmer should avoid moving data back and forth to a file, instead it should minimize the IO pressure.
The some applies when processing an array, locality will improve performance.
Note however that these are technically, micro-optimisations and the topic is not as simple as it appears.
1 In general divide the hardware in what is: inside the core (registers), inside the CPU (caches, possibly not the LLC), inside the socket (GPU, LLC), behind dedicated bus devices (memory, other CPUs), behind one generic bus (PCIe - internal devices like network cards), behind two or more buses (USB devices, disks) and in another computer entirely (servers).

Cache or Registers - which is faster?

I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.

Resources