Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple threads run on the same core each thread may find their data in one of the sub-pages. I'm just coming up with this off the top of my head so I'm very interested in educating myself on what is really going on underneath the scenes. Can anyone share their insights or provide me with a link that will cure me of my ignorance?
Many thanks in advance.
There is single (sliced) L3 cache in single-socket chip, and several L2 caches (one per real physical core).
L3 cache caches data in segments of size of 64 bytes (cache lines), and there is special Cache coherence protocol between L3 and different L2/L1 (and between several chips in the NUMA/ccNUMA multi-socket systems too); it tracks which cache line is actual, which is shared between several caches, which is just modified (and should be invalidated from other caches). Some of protocols (cache line possible states and state translation): https://en.wikipedia.org/wiki/MESI_protocol, https://en.wikipedia.org/wiki/MESIF_protocol, https://en.wikipedia.org/wiki/MOESI_protocol
In older chips (epoch of Core 2) cache coherence was snooped on shared bus, now it is checked with help of directory.
In real life L3 is not just "single" but sliced into several slices, each of them having high-speed access ports. There is some method of selecting the slice based on physical address, which allow multicore system to do many accesses at every moment (each access will be directed by undocumented method to some slice; when two cores uses same physical address, their accesses will be served by same slice or by slices which will do cache coherence protocol checks).
Information about L3 cache slices was reversed in several papers:
https://cmaurice.fr/pdf/raid15_maurice.pdf Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters
https://eprint.iacr.org/2015/690.pdf Systematic Reverse Engineering of Cache Slice Selection in Intel Processors
https://arxiv.org/pdf/1508.03767.pdf Cracking Intel Sandy Bridge’s Cache Hash Function
With recent chips programmer has ability to partition the L3 cache between applications "Cache Allocation Technology" (v4 Family): https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models https://danluu.com/intel-cat/ https://lwn.net/Articles/659161/
Modern Intel L3 caches (since Nehalem) use a 64B line size, the same as L1/L2. They're shared, and inclusive.
See also http://www.realworldtech.com/nehalem/2/
Since SnB at least, each core has part of the L3, and they're on a ring bus. So in big Xeons, L3 size scales linearly with number of cores.
See also Which cache mapping technique is used in intel core i7 processor? where I wrote a much larger and more complete answer.
Related
This is about cache coherency protocol across different layers of cache. My understanding(X86_64) about L1 is that, it is owned exclusively by a core and L2 is between 2 cores and L3 for all the cores in a CPU socket. I have read the MESI protocol functioning, about store buffers, invalidate queues, invalidate messages etc. My doubt here is that is the MESI applicable for L1 only or it is applicable for L2 and L3 as well. Or is there a different cache synchronizing between for L2 and L3 .
The number of cache levels, how each level is organized with respect to other processors or cores in the system, and the coherence protocol implemented in each cache is defined by the core microarchitecture, the uncore microarchitecture, and, in some cases, relevant boot-time configuration options. These design aspects vary by vendor and processor generation and models within the same generation. There a lot of different designs even if you just consider the processors released in the past few years.
The organization of the cache hierarchy is always clearly documented by Intel and AMD. However, the coherence protocols are not always clearly documented. You won't find a section in any official document that directly tells you all the protocols that caches use. Some hardware performance event names allude to what coherence protocol is used in the cache to which the events apply.
The instruction cache (L1I) always uses the SI protocol because a line is never modified between the point of fill and the point of invalidation. So an entry can either be in the S or I state. The M and E states are only relevant and the cache supports modifying an existing line.
Some microarchitectures have caches that only support the write-through write hit policy. For example, the L1D in the AMD Bulldozer is a write-through cache. The M state doesn't make sense in a write-through cache. This means that the L1D either uses SI or ESI. SI is more likely because it requires only a single bit of state per entry.
Intel processors almost always support the write-back policy in all data and unified caches. Old Intel processors (90s and early 2000s) with two levels of caches use MESI for the L1D and L2. Intel processors with three levels of caches also uses MESI for the L1D and L2. The fact that four states are available doesn't necessarily mean that all are being used. A cache line whose physical address falls within a region with the write-through (WT) memory type doesn't use the M state. (It's possible that the type changed from WB to WT, so the first WT access could hit in M.) So the effective protocol for a WT line is ESI or SI.
The L3 caches in Intel processors starting with Nehalem-EX uses the MESIF protocol with an inclusive directory (used on a hit) for the entire NUMA node. Nehalem-EX also employs an in-memory 2-state directory to track which lines are owned by the off-package IOH. The in-memory directory protocol changed in Westmere-EX, and then changed again in the Xeon E5, and again in the Xeon E5/E7 v2, and again in the Xeon E5/E7 v3. These processors also support multiple coherence protocols in the L3-miss scenario with different tradeoffs.
I'm not sure what else to say to answer your question. I guess you could say that MESI is more or less applicable to the L2 and L3.
As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing µ-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.
I'm sorry if this is the wrong place to ask this but I've searched and always found different answer. My question is:
Which is faster? Cache or CPU Registers?
According to me, the registers are what directly load data to execute it while the cache is just a storage place close or internally in the CPU.
Here are the sources I found that confuses me:
2 for cache | 1 for registers
http://in.answers.yahoo.com/question/index?qid=20110503030537AAzmDGp
Cache is faster.
http://wiki.answers.com/Q/Is_cache_memory_faster_than_CPU_registers
So which really is it?
CPU register is always faster than the L1 cache. It is the closest. The difference is roughly a factor of 3.
Trying to make this as intuitive as possible without getting lost in the physics underlying the question: there is a simple correlation between speed and distance in electronics. The further you make a signal travel, the harder it gets to get that signal to the other end of the wire without the signal getting corrupted. It is the "there is no free lunch" principle of electronic design.
The corollary is that bigger is slower. Because if you make something bigger then inevitably the distances are going to get larger. Something that was automatic for a while, shrinking the feature size on the chip automatically produced a faster processor.
The register file in a processor is small and sits physically close to the execution engine. The furthest removed from the processor is the RAM. You can pop the case and actually see the wires between the two. In between sit the caches, designed to bridge the dramatic gap between the speed of those two opposites. Every processor has an L1 cache, relatively small (32 KB typically) and located closest to the core. Further down is the L2 cache, relatively big (4 MB typically) and located further from the core. More expensive processors also have an L3 cache, bigger and further away.
Specifically on x86 architecture:
Reading from register has 0 or 1 cycle latency.
Writing to registers has 0 cycle latency.
Reading/Writing L1 cache has a 3 to 5 cycle latency (varies by architecture age)
Actual load/store requests may execute within 0 or 1 cycles due to write-back buffer and store-forwarding features (details below)
Reading from register can have a 1 cycle latency on Intel Core 2 CPUs (and earlier models) due to its design: If enough simultaneously-executing instructions are reading from different registers, the CPU's register bank will be unable to service all the requests in a single cycle. This design limitation isn't present in any x86 chip that's been put on the consumer market since 2010 (but it is present in some 2010/11-released Xeon chips).
L1 cache latencies are fixed per-model but tend to get slower as you go back in time to older models. However, keep in mind three things:
x86 chips these days have a write-back cache that has a 0 cycle latency. When you store a value to memory it falls into that cache, and the instruction is able to retire in a single cycle. Memory latency then only becomes visible if you issue enough consecutive writes to fill the write-back cache. Writeback caches have been prominent in desktop chip design since about 2001, but was widely missing from the ARM-based mobile chip markets until much more recently.
x86 chips these days have store forwarding from the write-back cache. If you store an address to the WB cache and then read back the same address several instructions later, the CPU will fetch the value from the WB cache instead of accessing L1 memory for it. This reduces the visible latency on what appears to be an L1 request to 1 cycle. But in fact, the L1 isn't be referenced at all in that case. Store forwarding also has some other rules for it to work properly, which also vary a lot across the various CPUs available on the market today (typically requiring 128-bit address alignment and matched operand size).
The store forwarding feature can generate false positives where-in the CPU thinks the address is in the writeback buffer based on a fast partial-bits check (usually 10-14 bits, depending on chip). It uses an extra cycle to verify with a full check. If that fails then the CPU must re-route as a regular memory request. This miss can add an extra 1-2 cycles latency to qualifying L1 cache accesses. In my measurements, store forwarding misses happen quite often on AMD's Bulldozer, for example; enough so that its L1 cache latency over-time is about 10-15% higher than its documented 3-cycles. It is almost a non-factor on Intel's Core series.
Primary reference: http://www.agner.org/optimize/ and specifically http://www.agner.org/optimize/microarchitecture.pdf
And then manually graph info from that with the tables on architectures, models, and release dates from the various List of CPUs pages on wikipedia.
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by the CPU core every clock cycle). It needs lots of read/write ports to support this high access bandwidth. Building a large cache with these properties is impossible. Thus, designers keep it small, e.g. 32KB in most processors today.
L2 is accessed only on L1 misses, so accesses are less frequent (usually 1/20th of the L1). Thus, L2 can have higher latency (e.g. from 10 to 20 cycles) and have fewer ports. This allows designers to make it bigger.
L1 and L2 play very different roles. If L1 is made bigger, it will increase L1 access latency which will drastically reduce performance because it will make all dependent loads slower and harder for out-of-order execution to hide. L1 size is barely debatable.
If we removed L2, L1 misses will have to go to the next level, say memory. This means that a lot of access will be going to memory which would imply we need more memory bandwidth, which is already a bottleneck. Thus, keeping the L2 around is favorable.
Experts often refer to L1 as a latency filter (as it makes the common case of L1 hits faster) and L2 as a bandwidth filter as it reduces memory bandwidth usage.
Note: I have assumed a 2-level cache hierarchy in my argument to make it simpler. In many of today's multicore chips, there's an L3 cache shared between all the cores, while each core has its own private L1 and maybe L2. In these chips, the shared last-level cache (L3) plays the role of memory bandwidth filter. L2 plays the role of on-chip bandwidth filter, i.e. it reduces access to the on-chip interconnect and the L3. This allows designers to use a lower-bandwidth interconnect like a ring, and a slow single-port L3, which allows them to make L3 bigger.
Perhaps worth mentioning that the number of ports is a very important design point because it affects how much chip area the cache consumes. Ports add wires to the cache which consumes a lot of chip area and power.
There are different reasons for that.
L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing (and expensive) than the slowest hardware and yet cheaper than the faster hardware you have. Even if you decided to double the L1 cache, you would also increment L2, to speedup L1-cache misses.
So why is there L2 cache at all? Well, L1 cache is usually more performant and expensive to build, and it is bound to a single core. This means that increasing the L1 size by a fixed quantity will have that cost multiplied by 4 in a dual core processor, or by 8 in a quad core. L2 is usually shared by different cores --depending on the architecture it can be shared across a couple or all cores in the processor, so the cost of increasing L2 would be smaller even if the price of L1 and L2 were the same --which it is not.
#Aater's answer explains some of the basics. I'll add some more details + an examples of the real cache organization on Intel Haswell and AMD Piledriver, with latencies and other properties, not just size.
For some details on IvyBridge, see my answer on "How can cache be that fast?", with some discussion of the overall load-use latency including address-calculation time, and widths of the data busses between different levels of cache.
L1 needs to be very fast (latency and throughput), even if that means a limited hit-rate. L1d also needs to support single-byte stores on almost all architectures, and (in some designs) unaligned accesses. This makes it hard to use ECC (error correction codes) to protect the data, and in fact some L1d designs (Intel) just use parity, with better ECC only in outer levels of cache (L2/L3) where the ECC can be done on larger chunks for lower overhead.
It's impossible to design a single level of cache that could provide the low average request latency (averaged over all hits and misses) of a modern multi-level cache. Since modern systems have multiple very hungry cores all sharing a connection to the same relatively-high latency DRAM, this is essential.
Every core needs its own private L1 for speed, but at least the last level of cache is typically shared, so a multi-threaded program that reads the same data from multiple threads doesn't have to go to DRAM for it on each core. (And to act as a backstop for data written by one core and read by another). This requires at least two levels of cache for a sane multi-core system, and is part of the motivation for more than 2 levels in current designs. Modern multi-core x86 CPUs have a fast 2-level cache in each core, and a larger slower cache shared by all cores.
L1 hit-rate is still very important, so L1 caches are not as small / simple / fast as they could be, because that would reduce hit rates. Achieving the same overall performance would thus require higher levels of cache to be faster. If higher levels handle more traffic, their latency is a bigger component of the average latency, and they bottleneck on their throughput more often (or need higher throughput).
High throughput often means being able to handle multiple reads and writes every cycle, i.e. multiple ports. This takes more area and power for the same capacity as a lower-throughput cache, so that's another reason for L1 to stay small.
L1 also uses speed tricks that wouldn't work if it was larger. i.e. most designs use Virtually-Indexed, Physically Tagged (VIPT) L1, but with all the index bits coming from below the page offset so they behave like PIPT (because the low bits of a virtual address are the same as in the physical address). This avoids synonyms / homonyms (false hits or the same data being in the cache twice, and see Paul Clayton's detailed answer on the linked question), but still lets part of the hit/miss check happen in parallel with the TLB lookup. A VIVT cache doesn't have to wait for the TLB, but it has to be invalidated on every change to the page tables.
On x86 (which uses 4kiB virtual memory pages), 32kiB 8-way associative L1 caches are common in modern designs. The 8 tags can be fetched based on the low 12 bits of the virtual address, because those bits are the same in virtual and physical addresses (they're below the page offset for 4kiB pages). This speed-hack for L1 caches only works if they're small enough and associative enough that the index doesn't depend on the TLB result. 32kiB / 64B lines / 8-way associativity = 64 (2^6) sets. So the lowest 6 bits of an address select bytes within a line, and the next 6 bits index a set of 8 tags. This set of 8 tags is fetched in parallel with the TLB lookup, so the tags can be checked in parallel against the physical-page selection bits of the TLB result to determine which (if any) of the 8 ways of the cache hold the data. (Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical)
Making a larger L1 cache would mean it had to either wait for the TLB result before it could even start fetching tags and loading them into the parallel comparators, or it would have to increase in associativity to keep log2(sets) + log2(line_size) <= 12. (More associativity means more ways per set => fewer total sets = fewer index bits). So e.g. a 64kiB cache would need to be 16-way associative: still 64 sets, but each set has twice as many ways. This makes increasing L1 size beyond the current size prohibitively expensive in terms of power, and probably even latency.
Spending more of your power budget on L1D cache logic would leave less power available for out-of-order execution, decoding, and of course L2 cache and so on. Getting the whole core to run at 4GHz and sustain ~4 instructions per clock (on high-ILP code) without melting requires a balanced design. See this article: Modern Microprocessors: A 90-Minute Guide!.
The larger a cache is, the more you lose by flushing it, so a large VIVT L1 cache would be worse than the current VIPT-that-works-like-PIPT. And a larger but higher-latency L1D would probably also be worse.
According to #PaulClayton, L1 caches often fetch all the data in a set in parallel with the tags, so it's there ready to be selected once the right tag is detected. The power cost of doing this scales with associativity, so a large highly-associative L1 would be really bad for power-use as well as die-area (and latency). (Compared to L2 and L3, it wouldn't be a lot of area, but physical proximity is important for latency. Speed-of-light propagation delays matter when clock cycles are 1/4 of a nanosecond.)
Slower caches (like L3) can run at a lower voltage / clock speed to make less heat. They can even use different arrangements of transistors for each storage cell, to make memory that's more optimized for power than for high speed.
There are a lot of power-use related reasons for multi-level caches. Power / heat is one of the most important constraints in modern CPU design, because cooling a tiny chip is hard. Everything is a tradeoff between speed and power (and/or die area). Also, many CPUs are powered by batteries or are in data-centres that need extra cooling.
L1 is almost always split into separate instruction and data caches. Instead of an extra read port in a unified L1 to support code-fetch, we can have a separate L1I cache tied to a separate I-TLB. (Modern CPUs often have an L2-TLB, which is a second level of cache for translations that's shared by the L1 I-TLB and D-TLB, NOT a TLB used by the regular L2 cache). This gives us 64kiB total of L1 cache, statically partitioned into code and data caches, for much cheaper (and probably lower latency) than a monster 64k L1 unified cache with the same total throughput. Since there is usually very little overlap between code and data, this is a big win.
L1I can be placed physically close to the code-fetch logic, while L1D can be physically close to the load/store units. Speed-of-light transmission-line delays are a big deal when a clock cycle lasts only 1/3rd of a nanosecond. Routing the wiring is also a big deal: e.g. Intel Broadwell has 13 layers of copper above the silicon.
Split L1 helps a lot with speed, but unified L2 is the best choice.
Some workloads have very small code but touch lots of data. It makes sense for higher-level caches to be unified to adapt to different workloads, instead of statically partitioning into code vs. data. (e.g. almost all of L2 will be caching data, not code, while running a big matrix multiply, vs. having a lot of code hot while running a bloated C++ program, or even an efficient implementation of a complicated algorithm (e.g. running gcc)). Code can be copied around as data, not always just loaded from disk into memory with DMA.
Caches also need logic to track outstanding misses (since out-of-order execution means that new requests can keep being generated before the first miss is resolved). Having many misses outstanding means you overlap the latency of the misses, achieving higher throughput. Duplicating the logic and/or statically partitioning between code and data in L2 would not be good.
Larger lower-traffic caches are also a good place to put pre-fetching logic. Hardware pre-fetching enables good performance for things like looping over an array without every piece of code needing software-prefetch instructions. (SW prefetch was important for a while, but HW prefetchers are smarter than they used to be, so that advice in Ulrich Drepper's otherwise excellent What Every Programmer Should Know About Memory is out-of-date for many use cases.)
Low-traffic higher level caches can afford the latency to do clever things like use an adaptive replacement policy instead of the usual LRU. Intel IvyBridge and later CPUs do this, to resist access patterns that get no cache hits for a working set just slightly too large to fit in cache. (e.g. looping over some data in the same direction twice means it probably gets evicted just before it would be reused.)
A real example: Intel Haswell. Sources: David Kanter's microarchitecture analysis and Agner Fog's testing results (microarch pdf). See also Intel's optimization manuals (links in the x86 tag wiki).
Also, I wrote up a separate answer on: Which cache mapping technique is used in intel core i7 processor?
Modern Intel designs use a large inclusive L3 cache shared by all cores as a backstop for cache-coherence traffic. It's physically distributed between the cores, with 2048 sets * 16-way (2MiB) per core (with an adaptive replacement policy in IvyBridge and later).
The lower levels of cache are per-core.
L1: per-core 32kiB each instruction and data (split), 8-way associative. Latency = 4 cycles. At least 2 read ports + 1 write port. (Maybe even more ports to handle traffic between L1 and L2, or maybe receiving a cache line from L2 conflicts with retiring a store.) Can track 10 outstanding cache misses (10 fill buffers).
L2: unified per-core 256kiB, 8-way associative. Latency = 11 or 12 cycles. Read bandwidth: 64 bytes / cycle. The main prefetching logic prefetches into L2. Can track 16 outstanding misses. Can supply 64B per cycle to the L1I or L1D. Actual port counts unknown.
L3: unified, shared (by all cores) 8MiB (for a quad-core i7). Inclusive (of all the L2 and L1 per-core caches). 12 or 16 way associative. Latency = 34 cycles. Acts as a backstop for cache-coherency, so modified shared data doesn't have to go out to main memory and back.
Another real example: AMD Piledriver: (e.g. Opteron and desktop FX CPUs.) Cache-line size is still 64B, like Intel and AMD have used for several years now. Text mostly copied from Agner Fog's microarch pdf, with additional info from some slides I found, and more details on the write-through L1 + 4k write-combining cache on Agner's blog, with a comment that only L1 is WT, not L2.
L1I: 64 kB, 2-way, shared between a pair of cores (AMD's version of SMD has more static partitioning than Hyperthreading, and they call each one a core. Each pair shares a vector / FPU unit, and other pipeline resources.)
L1D: 16 kB, 4-way, per core. Latency = 3-4 c. (Notice that all 12 bits below the page offset are still used for index, so the usual VIPT trick works.) (throughput: two operations per clock, up to one of them being a store). Policy = Write-Through, with a 4k write-combining cache.
L2: 2 MB, 16-way, shared between two cores. Latency = 20 clocks. Read throughput 1 per 4 clock. Write throughput 1 per 12 clock.
L3: 0 - 8 MB, 64-way, shared between all cores. Latency = 87 clock. Read throughput 1 per 15 clock. Write throughput 1 per 21 clock
Agner Fog reports that with both cores of a pair active, L1 throughput is lower than when the other half of a pair is idle. It's not known what's going on, since the L1 caches are supposed to be separate for each core.
The other answers here give specific and technical reasons why L1 and L2 are sized as they are, and while many of them are motivating considerations for particular architectures, they aren't really necessary: the underlying architectural pressure leading to increasing (private) cache sizes as you move away from the core is fairly universal and is the same as the reasoning for multiple caches in the first place.
The three basic facts are:
The memory accesses for most applications exhibit a high degree of temporal locality, with a non-uniform distribution.
Across a large variety of process and designs, cache size and cache speed (latency and throughput) can be traded off against each other1.
Each distinct level of cache involves incremental design and performance cost.
So at a basic level, you might be able to say double the size of the cache, but incur a latency penalty of 1.4 compared to the smaller cache.
So it becomes an optimization problem: how many caches should you have and how large should they be? If memory access was totally uniform within the working set size, you'd probably end up with a single fairly large cache, or no cache at all. However, access is strongly non-uniform, so a small-and-fast cache can capture a large number of accesses, disproportionate to it's size.
If fact 2 didn't exist, you'd just create a very big, very fast L1 cache within the other constraints of your chip and not need any other cache levels.
If fact 3 didn't exist, you'd end up with a huge number of fine-grained "caches", faster and small at the center, and slower and larger outside, or perhaps a single cache with variable access times: faster for the parts closest to the core. In practice, rule 3 means that each level of cache has an additional cost, so you usually end up with a few quantized levels of cache2.
Other Constraints
This gives a basic framework to understand cache count and cache sizing decisions, but there are secondary factors at work as well. For example, Intel x86 has 4K page sizes and their L1 caches use a VIPT architecture. VIPT means that the size of the cache divided by the number of ways cannot be larger3 than 4 KiB. So an 8-way L1 cache as used on the half dozen Intel designs can be at most 4 KiB * 8 = 32 KiB. It is probably no coincidence that that's exactly the size of the L1 cache on those designs! If it weren't for this constraint, it is entirely possible you'd have seen lower-associativity and/or larger L1 caches (e.g., 64 KiB, 4-way).
1 Of course, there are other factors involved in the tradeoff as well, such as area and power, but holding those factors constant the size-speed tradeoff applies, and even if not held constant the basic behavior is the same.
2 In addition to this pressure, there is a scheduling benefit to known-latency caches, like most L1 designs: and out-of-order scheduler can optimistically submit operations that depend on a memory load on the cycle that the L1 cache would return, reading the result off the bypass network. This reduces contention and perhaps shaves a cycle of latency off the critical path. This puts some pressure on the innermost cache level to have uniform/predictable latency and probably results in fewer cache levels.
3 In principle, you can use VIPT caches without this restriction, but only by requiring OS support (e.g., page coloring) or with other constraints. The x86 arch hasn't done that and probably can't start now.
For those interested in this type of questions, my university recommends Computer Architecture: A Quantitative Approach and Computer Organization and Design: The Hardware/Software Interface. Of course, if you don't have time for this, a quick overview is available on Wikipedia.
I think the main reason for this is, that L1-Cache is faster and so it's more expensive.
https://en.wikichip.org/wiki/amd/microarchitectures/zen#Die
Compare the size of the L1, L2, and L3 caches physical size for an AMD Zen core, for example. The density increases dramatically with the cache level.
logically, the question answers itself.
If L1 were bigger than L2 (combined), then there would be no need of L2 Cache.
Why would you store your stuff on tape-drive if you can store all of it on HDD ?
I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/