Is there such thing as a semi-shared cache? - caching

I'm doing a little research on the caching hierarchy and have come across the concept of shared and private caches. I can see examples of where caches are either private to a specific core (at Higher levels) and then where the cache is shared amongst all of the cores.
Are there any such examples of a cache being shared across a certain subset of cores at an intermediary hierarchy level and if not, why? My impression is that this would act as a middle-ground in the trade-off between latency and hit rate, although I'm unable to find an example of such a structure.

Sharing an intermediate level cache among multiple cores — but fewer than share the last level cache — is not a common design point. There are, however, a few designs that share L2 cache with as many cores as L3 cache is shared.
POWER4 and POWER5 both shared L2 cache among two cores, with L3 also shared among two cores. Since L3 cache data was stored off-chip (tags were on-chip) and each chip only had two cores, this is more similar to just sharing last level cache. Total L2 capacity was strongly constrained by chip size and L3 (having off-chip data) had somewhat high latency, so sharing to increase effective capacity was more attractive than for more recent designs with on-chip L3.
SPARC M7 is a more interesting example. M7 had a 256 KiB L2 data cache shared among two cores and an L2 instruction cache shared among four cores with L3 shared among four cores (the documentation I have seen is not entirely clear that L3 is not unified, but the evidence generally points to L3 being private to each cluster of four cores). Since data L2 is shared only among two cores, this might count as sharing L2 among fewer cores than L3 even though instruction L2 is shared with the same number of cores as L3.
Since M7 cores are 8-way threaded (as well as having only two-wide, out-of-order execution), L2 latency is less important (both thread-level parallelism and instruction level parallelism extracted from out-of-order execution can hide latency and a narrower core reduces the execution potential loss from a given number of stall cycles). Since the processor targets commercial workloads with high thread-level parallelism and low instruction-level parallelism, increasing the core and thread count were primary goals; sharing L2 caches can exploit common instruction and data use — the former is especially significant, but data sharing is not rare — facilitating lower total capacity, leaving room for more cores.
SPARC M8 was similar, but the L2 data cache was made private and the issue width doubled to four-wide. The increase in issue width increases the importance of L2 latency, especially with modest sized (16 KiB) L1 caches. Instruction cache is somewhat more latency tolerant given an ability to fetch ahead in an instruction stream.
Some considerations of the tradeoffs of intermediate level cache sharing
Increasing the size of an L2 cache via sharing would reduce the capacity miss rate when capacity demand is imbalanced (not only when one core is inactive but even when different phases of the same program are active on different cores), but sharing L2 among multiple cores increases conflict misses. Increasing associativity can eliminate this effect at the cost of higher energy per access.
When two cores access the same memory locations within a shortish period of time, a shared cache can increase effective capacity by reducing replication as well as potentially improving replacement decisions and providing limited prefetch. Sharing can also reduce cache block ping-pong if the writer and reader share L2 cache; however, explicitly taking advantage of such increases the complexity of software core allocation. If sharing of a frequently written value is unavoidably common, even a random reduction in ping-ponging may be attractive, but the benefit diminishes rapidly as the number of cores involved increases.
When L2 is an intermediate level cache, access latency has significant importance since capacity misses from a smaller L2 will generally hit in L3. Doubling the capacity will increase access latency by more than 40% (latency is roughly proportional to the square root of capacity). Arbitration among multiple requester's also tends to increase latency. (A non-uniform cache architecture, where different cache blocks have different latencies can compensate for such. E.g., in the context of sharing among two cores, a quarter of the capacity could be located closest to each core and the remaining half at an intermediate distance from both cores. However, NUCA introduces complexity in allocation.)
While increasing L2 capacity would use area that could otherwise by used by L3 cache (or more cores or other features), the size of L3 slices is typically so much larger than L2 capacity that this effect is not a primary consideration.
Sharing L2 among two cores also means that the provided bandwidth must be suitable for two highly active cores. While banking can be used to facilitate such (and extra bandwidth might be exploitable by a single active core), such increased bandwidth is not entirely free.
Sharing L2 would also motivate increasing the complexity of cache allocation and replacement. One would prefer to avoid one core wasting capacity (or even associativity). Such moderating mechanisms are sometimes provided for last level cache (e.g., Intel's Cache Allocation Technology), so this is not a hard barrier. Some of the moderating mechanisms could also facilitate better replacement in general, and L2 mechanisms could exploit metadata associated with L3 cache (reducing the tagging overhead for metadata tracking) to adjust behavior.
Sharing L2 cache also introduces complexity with respect to frequency adjustment. If one core supports a lower frequency, the interface between the core and L2 becomes more complex, increasing access latency. (In theory, a NUCA design like that mentioned above could have a small close portion running at the local frequency and only pay the clock boundary crossing penalty when accessing the more distant portion.)
Power gating is also simplified when L2 cache is dedicated to a single core. Rather than having three power domains (two cores and L2), a private L2 can be turned off with its core so only two power domains are needed. (Note that adding power domains is not extremely expensive and has been proposed for reducing power by dynamically reducing cache capacity.)
A shared L2 cache can also provide a convenient merging point for the on-chip network, reducing the number of nodes in the broader network. (This merging could alternatively be done behind the L2 cache, providing lower latency and potentially higher bandwidth communication between two cores while also providing isolation.)
Conclusion
Fundamentally, sharing increases utilization — which is good for throughput (roughly speaking, efficiency) but bad for latency (local performance) — but hinders optimization by specialization. For L2 caches with a backing L3 cache, the specialization benefit (lower latency) tends to outweigh the utilization benefit for general designs (which generally trade throughput and efficiency for lower latency). The on-chip L3 cache reduces the cost of L2 capacity misses, so a higher L2 miss rate with a faster L2 hit time can reduce average memory access time.
At the cost of design complexity and some overheads, sharing can be made more flexible or the costs of sharing can be reduced. Increasing complexity increases development risk and marketing risk (not just time to market but feature complexity increases the difficulty of the buyer's choice yet marketing simplifications can seem deceptive). For L2 caches, the costs of more nuanced sharing seem to have generally not be considered worth the potential benefits.

Related

Optimal buffer size to avoid cache misses for recent i7 / i9 CPUs

Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.
I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.
Let's expect current i7/i9 CPUs (2018)
Any ideas?
Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).
Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)
Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.
Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.
Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.
Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.
Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.

Off Chip Cache Coherence and L2 cache partitioning in multicores (a programmer's view)

Well I recently studied that in order to save chip-area, multicore processors don't have the cache coherence hardware at the L1 level. Rather the L2 cache is partitioned (no. of partitions = no. of hyperthreads or whatever) to enforce off-chip cache coherence. Atleast this is what I interpreted from the lecture. Is this correct?
If yes, then I am unable to visualize how this is even possible. How can you ignore the coherence at L1 level? If my interpretation is incorrect then please shed some light on off-chip cache coherence and why the L2 is partitioned..
Thanks!
The lecture was probably indicating that the L1 cache in a multicore processor is not generally snooped to maintain coherence. Instead a higher level of the cache hierarchy filters coherence traffic. With a fully inclusive (in tags only or tags and data) level of cache, extra bits can provide a local coherence directory--e.g., a bit vector of all cores or larger nodes indicating if the node has the cache block. (This directory may be used as a filter rather than an exact tracking, e.g., to avoid buffering on lower-level cache evictions.) Other forms of filtering are also possible. The primary requirement is that all cases where the data is present in a lower level cache are detected, a modest fraction of false positives would only modestly increase the amount of snoop traffic going to the lower level caches.
Without such a filter, every miss on another core/node would have to probe all the other L1 caches. In addition to using more interconnect bandwidth, this extra tag probing requirement would typically be handled by replicating the L1 tags because L1 caches are highly optimized for latency and access bandwidth (making it more desirable to avoid unnecessary interference from coherence probes).
In a common multicore processor with on-chip L3, L2 caches are "private" to a node of one or a small number of cores. (Private in this context means that allocations are driven by the cores within the node. This L2 capacity is not used by other nodes.) Such a private L2 filters accesses from reaching the shared L3 on a hit (as long as it does not require an update to exclusive/modified status). By sharing L2 cache among only a small number (often one) of cores, access latency is kept lower both by more direct connection to the cores and by requiring a lower capacity. (Sharing L2 among two or even four cores can reduce the number of nodes in the higher level network and balance utilization of L2 capacity.)
The last (on-chip) level of cache (LLC) is often partitioned. Attaching a slice to each lower level node allows that slice to have lower latency for communication with that node. Cache blocks that are accessed by that node can be preferentially placed in that slice or in a nearby (by network topology) slice to allow lower latency (and potentially higher bandwidth) local access. (This is a Non-Uniform Cache Architecture optimization. Because blocks are not tied to a specific slice based on address or accessing node it is possible to migrate and even replicate blocks.)
Alternately, allocation to the LLC slices can be strictly based on address, possibly associating each LLC slice with a memory controller. This requires only one slice to be probed to determine a hit or miss and fits with the use of a crossbar interconnect between lower level nodes and the LLC slices. This arrangement has the disadvantages that the memory controller-LLC connection is less latency critical and that utilization is tied to balanced demand based on address. However, it can provide faster determination of an L3 hit/miss and may (if slices are associated with memory controllers) reduce overhead for prefetching from memory and eager writeback. (When misses are more common and/or blocks are frequently shared by multiple nodes, address-based allocation becomes more attractive because a miss only needs to probe one slice [in addition to possibly supporting more aggressive prefetching and more likely being memory bandwidth limited rather than LLC capacity limited--so imbalance in memory controller use would be bad anyway] and a shared block can be more directly accessed by all of the nodes that use it without replication.)
(Obviously combinations of these two allocation methods can be used. Even just biasing allocation based on address could reduce demand on interconnect bandwidth.)
Partitioning tends to reduce latency (especially with a NUCA arrangement) and design complexity as well as facilitate design reuse with different numbers of partitions (and perhaps defect isolation so that a chip with a manufacturing defect can more easily be used as a product with fewer partitions).

why are there multiple layers of caches

Does anyone know why in most of todays processors there are several layers of caches. Like L1 L2 and L3. Why cant a processor do with one big L1 cache?
Isnt having multiple layers of cache increases the complexity of caching protocols?
Die size. L1 is usually on-die; there is not room for a large cache on-die. L2/3 gets its own die and can be bigger and processed differently.
Also speed; L1 is built with tradeoffs for maximum speed, while L2/3 doesn't have to be as aggressively sped up.
Also multi-core. Modern multi-core processors give each core its own L1 for speed, but they share some or all of the other caches for coherency.
That said, PA-RISC processors have been built with the "let's just make a big L1 cache" approach. They were expensive.
Why cant a processor do with one big L1 cache?
The larger your processor cache, the longer the latency. There are also practical and cost considerations, since larger caches occupy more physical space on a chip. After a certain size, you lose too much of the caching speedup to make it worth it to increase cache size further. Eventually, therefore, a large cache becomes undesirable.
Processor designs that still want a large cache can make a tradeoff by having multiple cache levels. You start with a small and fast cache, and gradually fall back to larger, slower caches on successive misses.
B/c in today's architectures you have more than one CPU/core accessing the memory. The L3 cache is a cache of caches that is shared between all the CPUs. This reduces the amount of data that needs to go through the memory bus, which is usually a good idea. If you want, you can have a look at : https://imgur.com/gallery/aBKD0Fv which shows how the layers are organized and how did they evolve through time.

What is locality of reference?

I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is,
Spatial Locality of reference
Temporal Locality of reference
This would not matter if your computer was filled with super-fast memory.
But unfortunately that's not the case and computer-memory looks something like this1:
+----------+
| CPU | <<-- Our beloved CPU, superfast and always hungry for more data.
+----------+
|L1 - Cache| <<-- ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput
+----------+
|L2 - Cache| <<-- ~12 CPU-cycles access latency (fast)
+----+-----+
|
+----------+
|L3 - Cache| <<-- ~35 CPU-cycles access latency (medium)
+----+-----+ (usually shared between CPU-cores)
|
| <<-- This thin wire is the memory bus, it has limited bandwidth.
+----+-----+
| main-mem | <<-- ~100 CPU-cycles access latency (slow)
+----+-----+ <<-- The main memory is big but slow (because we are cheap-skates)
|
| <<-- Even slower wire to the harddisk
+----+-----+
| harddisk | <<-- Works at 0,001% of CPU speed
+----------+
Spatial Locality
In this diagram, the closer data is to the CPU the faster the CPU can get at it.
This is related to Spacial Locality. Data has spacial locality if it is located close together in memory.
Because of the cheap-skates that we are RAM is not really Random Access, it is really Slow if random, less slow if accessed sequentially Access Memory SIRLSIAS-AM. DDR SDRAM transfers a whole burst of 32 or 64 bytes for one read or write command.
That is why it is smart to keep related data close together, so you can do a sequential read of a bunch of data and save time.
Temporal locality
Data stays in main-memory, but it cannot stay in the cache, or the cache would stop being useful. Only the most recently used data can be found in the cache; old data gets pushed out.
This is related to temporal locality. Data has strong temporal locality if it is accessed at the same time.
This is important because if item A is in the cache (good) than Item B (with strong temporal locality to A) is very likely to also be in the cache.
Footnote 1:
This is a simplification with latency cycle counts estimated from various cpus for example purposes, but give you the right order-of-magnitude idea for typical CPUs.
In reality latency and bandwidth are separate factors, with latency harder to improve for memory farther from the CPU. But HW prefetching and/or out-of-order exec can hide latency in some cases, like looping over an array. With unpredictable access patterns, effective memory throughput can be much lower than 10% of L1d cache.
For example, L2 cache bandwidth is not necessarily 3x worse than L1d bandwidth. (But it is lower if you're using AVX SIMD to do 2x 32-byte loads per clock cycle from L1d on a Haswell or Zen2 CPU.)
This simplified version also leaves out TLB effects (page-granularity locality) and DRAM-page locality. (Not the same thing as virtual memory pages). For a much deeper dive into memory hardware and tuning software for it, see What Every Programmer Should Know About Memory?
Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why a multi-level cache hierarchy is necessary to get the combination of latency/bandwidth and capacity (and hit-rate) we want.
One huge fast L1-data cache would be prohibitively power-expensive, and still not even possible with as low latency as the small fast L1d cache in modern high-performance CPUs.
In multi-core CPUs, L1i/L1d and L2 cache are typically per-core private caches, with a shared L3 cache. Different cores have to compete with each other for L3 and memory bandwidth, but each have their own L1 and L2 bandwidth. See How can cache be that fast? for a benchmark result from a dual-core 3GHz IvyBridge CPU: aggregate L1d cache read bandwidth on both cores of 186 GB/s vs. 9.6 GB/s DRAM read bandwidth with both cores active. (So memory = 10% L1d for single-core is a good bandwidth estimate for desktop CPUs of that generation, with only 128-bit SIMD load/store data paths). And L1d latency of 1.4 ns vs. DRAM latency of 72 ns
It is a principle which states that if some variables are referenced
by a program, it is highly likely that the same might be referenced
again (later in time - also known as temporal locality) .
It is also highly likely that any consecutive storage in memory might
be referenced sooner (spatial locality)
First of all, note that these concepts are not universal laws, they are observations about common forms of code behavior that allow CPU designers to optimize their system to perform better over most of the programs. At the same time, these are properties that programmers seek to adopt in their programs as they know that's how memory systems are built and that's what CPU designers optimize for.
Spatial locality refers to the property of some (most, actually) applications to access memory in a sequential or strided manner. This usually stems from the fact that the most basic data structure building blocks are arrays and structs, both of which store multiple elements adjacently in memory. In fact, many implementations of data structures that are semantically linked (graphs, trees, skip lists) are using arrays internally to improve performance.
Spatial locality allows a CPU to improve the memory access performance thanks to:
Memory caching mechanisms such as caches, page tables, memory controller page are already larger by design than what is needed for a single access. This means that once you pay the memory penalty for bringing data from far memory or a lower level cache, the more additional data you can consume from it the better is your utilization.
Hardware prefetching which exists on almost all CPUs today often covers spatial accesses. Everytime you fetch addr X, the prefetcher will likely fetch the next cache line, and possibly others further ahead. If the program exhibits a constant stride, most CPUs would be able to detect that as well and extrapolate to prefetch even further steps of the same stride. Modern spatial prefetchers may even predict variable recurring strides (e.g. VLDP, SPP)
Temporal locality refers to the property of memory accesses or access patterns to repeat themselves. In the most basic form this could mean that if address X was once accessed it may also be accessed in the future, but since caches already store recent data for a certain duration this form is less interesting (although there are mechanisms on some CPUs aimed to predict which lines are likely to be accessed again soon and which are not).
A more interesting form of temporal locality is that two (or more) temporally adjacent accesses observed once, may repeat together again. That is - if you once accessed address A and soon after that address B, and at some later point the CPU detects another access to address A - it may predict that you will likely access B again soon, and proceed to prefetch it in advance.
Prefetchers aimed to extract and predict this type of relations (temporal prefetchers) are often using relatively large storage to record many such relations. (See Markov prefetching, and more recently ISB, STMS, Domino, etc..)
By the way, these concepts are in no way exclusive, and a program can exhibit both types of localities (as well as other, more irregular forms). Sometimes both are even grouped together under the term spatio-temporal locality to represent the "common" forms of locality, or a combined form where the temporal correlation connects spatial constructs (like address delta always following another address delta).
Temporal locality of reference - A memory location that has been used recently is more likely to be accessed again. For e.g., Variables in a loop. Same set of variables (symbolic name for a memory locations) being used for some i number of iterations of a loop.
Spatial locality of reference - A memory location that is close to the currently accessed memory location is more likely to be accessed. For e.g., if you declare int a,b; float c,d; the compiler is likely to assign them consecutive memory locations. So if a is being used then it is very likely that b, c or d will be used in near future. This is one way how cachelines of 32 or 64 bytes, help. They are not of size 4 or 8 bytes (typical size of int,float, long and double variables).

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by the CPU core every clock cycle). It needs lots of read/write ports to support this high access bandwidth. Building a large cache with these properties is impossible. Thus, designers keep it small, e.g. 32KB in most processors today.
L2 is accessed only on L1 misses, so accesses are less frequent (usually 1/20th of the L1). Thus, L2 can have higher latency (e.g. from 10 to 20 cycles) and have fewer ports. This allows designers to make it bigger.
L1 and L2 play very different roles. If L1 is made bigger, it will increase L1 access latency which will drastically reduce performance because it will make all dependent loads slower and harder for out-of-order execution to hide. L1 size is barely debatable.
If we removed L2, L1 misses will have to go to the next level, say memory. This means that a lot of access will be going to memory which would imply we need more memory bandwidth, which is already a bottleneck. Thus, keeping the L2 around is favorable.
Experts often refer to L1 as a latency filter (as it makes the common case of L1 hits faster) and L2 as a bandwidth filter as it reduces memory bandwidth usage.
Note: I have assumed a 2-level cache hierarchy in my argument to make it simpler. In many of today's multicore chips, there's an L3 cache shared between all the cores, while each core has its own private L1 and maybe L2. In these chips, the shared last-level cache (L3) plays the role of memory bandwidth filter. L2 plays the role of on-chip bandwidth filter, i.e. it reduces access to the on-chip interconnect and the L3. This allows designers to use a lower-bandwidth interconnect like a ring, and a slow single-port L3, which allows them to make L3 bigger.
Perhaps worth mentioning that the number of ports is a very important design point because it affects how much chip area the cache consumes. Ports add wires to the cache which consumes a lot of chip area and power.
There are different reasons for that.
L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing (and expensive) than the slowest hardware and yet cheaper than the faster hardware you have. Even if you decided to double the L1 cache, you would also increment L2, to speedup L1-cache misses.
So why is there L2 cache at all? Well, L1 cache is usually more performant and expensive to build, and it is bound to a single core. This means that increasing the L1 size by a fixed quantity will have that cost multiplied by 4 in a dual core processor, or by 8 in a quad core. L2 is usually shared by different cores --depending on the architecture it can be shared across a couple or all cores in the processor, so the cost of increasing L2 would be smaller even if the price of L1 and L2 were the same --which it is not.
#Aater's answer explains some of the basics. I'll add some more details + an examples of the real cache organization on Intel Haswell and AMD Piledriver, with latencies and other properties, not just size.
For some details on IvyBridge, see my answer on "How can cache be that fast?", with some discussion of the overall load-use latency including address-calculation time, and widths of the data busses between different levels of cache.
L1 needs to be very fast (latency and throughput), even if that means a limited hit-rate. L1d also needs to support single-byte stores on almost all architectures, and (in some designs) unaligned accesses. This makes it hard to use ECC (error correction codes) to protect the data, and in fact some L1d designs (Intel) just use parity, with better ECC only in outer levels of cache (L2/L3) where the ECC can be done on larger chunks for lower overhead.
It's impossible to design a single level of cache that could provide the low average request latency (averaged over all hits and misses) of a modern multi-level cache. Since modern systems have multiple very hungry cores all sharing a connection to the same relatively-high latency DRAM, this is essential.
Every core needs its own private L1 for speed, but at least the last level of cache is typically shared, so a multi-threaded program that reads the same data from multiple threads doesn't have to go to DRAM for it on each core. (And to act as a backstop for data written by one core and read by another). This requires at least two levels of cache for a sane multi-core system, and is part of the motivation for more than 2 levels in current designs. Modern multi-core x86 CPUs have a fast 2-level cache in each core, and a larger slower cache shared by all cores.
L1 hit-rate is still very important, so L1 caches are not as small / simple / fast as they could be, because that would reduce hit rates. Achieving the same overall performance would thus require higher levels of cache to be faster. If higher levels handle more traffic, their latency is a bigger component of the average latency, and they bottleneck on their throughput more often (or need higher throughput).
High throughput often means being able to handle multiple reads and writes every cycle, i.e. multiple ports. This takes more area and power for the same capacity as a lower-throughput cache, so that's another reason for L1 to stay small.
L1 also uses speed tricks that wouldn't work if it was larger. i.e. most designs use Virtually-Indexed, Physically Tagged (VIPT) L1, but with all the index bits coming from below the page offset so they behave like PIPT (because the low bits of a virtual address are the same as in the physical address). This avoids synonyms / homonyms (false hits or the same data being in the cache twice, and see Paul Clayton's detailed answer on the linked question), but still lets part of the hit/miss check happen in parallel with the TLB lookup. A VIVT cache doesn't have to wait for the TLB, but it has to be invalidated on every change to the page tables.
On x86 (which uses 4kiB virtual memory pages), 32kiB 8-way associative L1 caches are common in modern designs. The 8 tags can be fetched based on the low 12 bits of the virtual address, because those bits are the same in virtual and physical addresses (they're below the page offset for 4kiB pages). This speed-hack for L1 caches only works if they're small enough and associative enough that the index doesn't depend on the TLB result. 32kiB / 64B lines / 8-way associativity = 64 (2^6) sets. So the lowest 6 bits of an address select bytes within a line, and the next 6 bits index a set of 8 tags. This set of 8 tags is fetched in parallel with the TLB lookup, so the tags can be checked in parallel against the physical-page selection bits of the TLB result to determine which (if any) of the 8 ways of the cache hold the data. (Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical)
Making a larger L1 cache would mean it had to either wait for the TLB result before it could even start fetching tags and loading them into the parallel comparators, or it would have to increase in associativity to keep log2(sets) + log2(line_size) <= 12. (More associativity means more ways per set => fewer total sets = fewer index bits). So e.g. a 64kiB cache would need to be 16-way associative: still 64 sets, but each set has twice as many ways. This makes increasing L1 size beyond the current size prohibitively expensive in terms of power, and probably even latency.
Spending more of your power budget on L1D cache logic would leave less power available for out-of-order execution, decoding, and of course L2 cache and so on. Getting the whole core to run at 4GHz and sustain ~4 instructions per clock (on high-ILP code) without melting requires a balanced design. See this article: Modern Microprocessors: A 90-Minute Guide!.
The larger a cache is, the more you lose by flushing it, so a large VIVT L1 cache would be worse than the current VIPT-that-works-like-PIPT. And a larger but higher-latency L1D would probably also be worse.
According to #PaulClayton, L1 caches often fetch all the data in a set in parallel with the tags, so it's there ready to be selected once the right tag is detected. The power cost of doing this scales with associativity, so a large highly-associative L1 would be really bad for power-use as well as die-area (and latency). (Compared to L2 and L3, it wouldn't be a lot of area, but physical proximity is important for latency. Speed-of-light propagation delays matter when clock cycles are 1/4 of a nanosecond.)
Slower caches (like L3) can run at a lower voltage / clock speed to make less heat. They can even use different arrangements of transistors for each storage cell, to make memory that's more optimized for power than for high speed.
There are a lot of power-use related reasons for multi-level caches. Power / heat is one of the most important constraints in modern CPU design, because cooling a tiny chip is hard. Everything is a tradeoff between speed and power (and/or die area). Also, many CPUs are powered by batteries or are in data-centres that need extra cooling.
L1 is almost always split into separate instruction and data caches. Instead of an extra read port in a unified L1 to support code-fetch, we can have a separate L1I cache tied to a separate I-TLB. (Modern CPUs often have an L2-TLB, which is a second level of cache for translations that's shared by the L1 I-TLB and D-TLB, NOT a TLB used by the regular L2 cache). This gives us 64kiB total of L1 cache, statically partitioned into code and data caches, for much cheaper (and probably lower latency) than a monster 64k L1 unified cache with the same total throughput. Since there is usually very little overlap between code and data, this is a big win.
L1I can be placed physically close to the code-fetch logic, while L1D can be physically close to the load/store units. Speed-of-light transmission-line delays are a big deal when a clock cycle lasts only 1/3rd of a nanosecond. Routing the wiring is also a big deal: e.g. Intel Broadwell has 13 layers of copper above the silicon.
Split L1 helps a lot with speed, but unified L2 is the best choice.
Some workloads have very small code but touch lots of data. It makes sense for higher-level caches to be unified to adapt to different workloads, instead of statically partitioning into code vs. data. (e.g. almost all of L2 will be caching data, not code, while running a big matrix multiply, vs. having a lot of code hot while running a bloated C++ program, or even an efficient implementation of a complicated algorithm (e.g. running gcc)). Code can be copied around as data, not always just loaded from disk into memory with DMA.
Caches also need logic to track outstanding misses (since out-of-order execution means that new requests can keep being generated before the first miss is resolved). Having many misses outstanding means you overlap the latency of the misses, achieving higher throughput. Duplicating the logic and/or statically partitioning between code and data in L2 would not be good.
Larger lower-traffic caches are also a good place to put pre-fetching logic. Hardware pre-fetching enables good performance for things like looping over an array without every piece of code needing software-prefetch instructions. (SW prefetch was important for a while, but HW prefetchers are smarter than they used to be, so that advice in Ulrich Drepper's otherwise excellent What Every Programmer Should Know About Memory is out-of-date for many use cases.)
Low-traffic higher level caches can afford the latency to do clever things like use an adaptive replacement policy instead of the usual LRU. Intel IvyBridge and later CPUs do this, to resist access patterns that get no cache hits for a working set just slightly too large to fit in cache. (e.g. looping over some data in the same direction twice means it probably gets evicted just before it would be reused.)
A real example: Intel Haswell. Sources: David Kanter's microarchitecture analysis and Agner Fog's testing results (microarch pdf). See also Intel's optimization manuals (links in the x86 tag wiki).
Also, I wrote up a separate answer on: Which cache mapping technique is used in intel core i7 processor?
Modern Intel designs use a large inclusive L3 cache shared by all cores as a backstop for cache-coherence traffic. It's physically distributed between the cores, with 2048 sets * 16-way (2MiB) per core (with an adaptive replacement policy in IvyBridge and later).
The lower levels of cache are per-core.
L1: per-core 32kiB each instruction and data (split), 8-way associative. Latency = 4 cycles. At least 2 read ports + 1 write port. (Maybe even more ports to handle traffic between L1 and L2, or maybe receiving a cache line from L2 conflicts with retiring a store.) Can track 10 outstanding cache misses (10 fill buffers).
L2: unified per-core 256kiB, 8-way associative. Latency = 11 or 12 cycles. Read bandwidth: 64 bytes / cycle. The main prefetching logic prefetches into L2. Can track 16 outstanding misses. Can supply 64B per cycle to the L1I or L1D. Actual port counts unknown.
L3: unified, shared (by all cores) 8MiB (for a quad-core i7). Inclusive (of all the L2 and L1 per-core caches). 12 or 16 way associative. Latency = 34 cycles. Acts as a backstop for cache-coherency, so modified shared data doesn't have to go out to main memory and back.
Another real example: AMD Piledriver: (e.g. Opteron and desktop FX CPUs.) Cache-line size is still 64B, like Intel and AMD have used for several years now. Text mostly copied from Agner Fog's microarch pdf, with additional info from some slides I found, and more details on the write-through L1 + 4k write-combining cache on Agner's blog, with a comment that only L1 is WT, not L2.
L1I: 64 kB, 2-way, shared between a pair of cores (AMD's version of SMD has more static partitioning than Hyperthreading, and they call each one a core. Each pair shares a vector / FPU unit, and other pipeline resources.)
L1D: 16 kB, 4-way, per core. Latency = 3-4 c. (Notice that all 12 bits below the page offset are still used for index, so the usual VIPT trick works.) (throughput: two operations per clock, up to one of them being a store). Policy = Write-Through, with a 4k write-combining cache.
L2: 2 MB, 16-way, shared between two cores. Latency = 20 clocks. Read throughput 1 per 4 clock. Write throughput 1 per 12 clock.
L3: 0 - 8 MB, 64-way, shared between all cores. Latency = 87 clock. Read throughput 1 per 15 clock. Write throughput 1 per 21 clock
Agner Fog reports that with both cores of a pair active, L1 throughput is lower than when the other half of a pair is idle. It's not known what's going on, since the L1 caches are supposed to be separate for each core.
The other answers here give specific and technical reasons why L1 and L2 are sized as they are, and while many of them are motivating considerations for particular architectures, they aren't really necessary: the underlying architectural pressure leading to increasing (private) cache sizes as you move away from the core is fairly universal and is the same as the reasoning for multiple caches in the first place.
The three basic facts are:
The memory accesses for most applications exhibit a high degree of temporal locality, with a non-uniform distribution.
Across a large variety of process and designs, cache size and cache speed (latency and throughput) can be traded off against each other1.
Each distinct level of cache involves incremental design and performance cost.
So at a basic level, you might be able to say double the size of the cache, but incur a latency penalty of 1.4 compared to the smaller cache.
So it becomes an optimization problem: how many caches should you have and how large should they be? If memory access was totally uniform within the working set size, you'd probably end up with a single fairly large cache, or no cache at all. However, access is strongly non-uniform, so a small-and-fast cache can capture a large number of accesses, disproportionate to it's size.
If fact 2 didn't exist, you'd just create a very big, very fast L1 cache within the other constraints of your chip and not need any other cache levels.
If fact 3 didn't exist, you'd end up with a huge number of fine-grained "caches", faster and small at the center, and slower and larger outside, or perhaps a single cache with variable access times: faster for the parts closest to the core. In practice, rule 3 means that each level of cache has an additional cost, so you usually end up with a few quantized levels of cache2.
Other Constraints
This gives a basic framework to understand cache count and cache sizing decisions, but there are secondary factors at work as well. For example, Intel x86 has 4K page sizes and their L1 caches use a VIPT architecture. VIPT means that the size of the cache divided by the number of ways cannot be larger3 than 4 KiB. So an 8-way L1 cache as used on the half dozen Intel designs can be at most 4 KiB * 8 = 32 KiB. It is probably no coincidence that that's exactly the size of the L1 cache on those designs! If it weren't for this constraint, it is entirely possible you'd have seen lower-associativity and/or larger L1 caches (e.g., 64 KiB, 4-way).
1 Of course, there are other factors involved in the tradeoff as well, such as area and power, but holding those factors constant the size-speed tradeoff applies, and even if not held constant the basic behavior is the same.
2 In addition to this pressure, there is a scheduling benefit to known-latency caches, like most L1 designs: and out-of-order scheduler can optimistically submit operations that depend on a memory load on the cycle that the L1 cache would return, reading the result off the bypass network. This reduces contention and perhaps shaves a cycle of latency off the critical path. This puts some pressure on the innermost cache level to have uniform/predictable latency and probably results in fewer cache levels.
3 In principle, you can use VIPT caches without this restriction, but only by requiring OS support (e.g., page coloring) or with other constraints. The x86 arch hasn't done that and probably can't start now.
For those interested in this type of questions, my university recommends Computer Architecture: A Quantitative Approach and Computer Organization and Design: The Hardware/Software Interface. Of course, if you don't have time for this, a quick overview is available on Wikipedia.
I think the main reason for this is, that L1-Cache is faster and so it's more expensive.
https://en.wikichip.org/wiki/amd/microarchitectures/zen#Die
Compare the size of the L1, L2, and L3 caches physical size for an AMD Zen core, for example. The density increases dramatically with the cache level.
logically, the question answers itself.
If L1 were bigger than L2 (combined), then there would be no need of L2 Cache.
Why would you store your stuff on tape-drive if you can store all of it on HDD ?

Resources