This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
IntelĀ® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)
Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;
Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.
The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.
From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.
I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).
For example, let's consider:
_mm_cmpestrc
This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?
I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.
For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.
See also
https://uops.info/ for accurate tables collected programmatically from microbenchmarks, so they're free from editing errors like Agner's tables sometimes have.
How many CPU cycles are needed for each assembly instruction?
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.
What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.
Modern Microprocessors: A 90-Minute Guide! very good intro to the basics of CPU pipelines and HW design constraints like power.
Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.
You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).
Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).
Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.
And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.
I think I have a decent understanding of the difference between latency and throughput
Your guesses don't seem to make sense, so you're definitely missing something.
CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)
(reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.
Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.
If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.
See also Latency bounds and throughput bounds for processors for operations that must occur in sequence for an example of a practice problem from CS:APP with a diagram of two dep chains, one also depending on results from the other.
Let assume that one row of cache has size 2^nB. Which hit ratio expected in the sequential reading byte by byte long contiguous memory?
To my eye it is (2^n - 1) / 2^n.
However, I am not sure if I am right. What do you think ?
yes, looks right for simple hardware (non-pipelined with no prefetching). e.g. 1 miss and 63 hits for 64B cache lines.
On real hardware, even in-order single-issue (non-superscalar) CPUs, miss under miss (multiple outstanding misses) is usually supported, so you will see misses until you run out of load buffers. This makes memory accesses pipelined as well, which is useful when misses to different cache lines can be in flight at once, instead of waiting the full latency for each one.
Real hardware will also have hardware prefetching. For example, have a look at Intel's article about disabling HW prefetching for some use-cases.
HW prefetching can probably keep up with a one-byte-at-a-time loop on most CPUs, so with good prefetching you might see hardly any L1 cache misses.
See Ulrich Drepper's What Every Programmer Should Know About Memory, and other links in the x86 tag wiki for more about real HW performance.
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by the CPU core every clock cycle). It needs lots of read/write ports to support this high access bandwidth. Building a large cache with these properties is impossible. Thus, designers keep it small, e.g. 32KB in most processors today.
L2 is accessed only on L1 misses, so accesses are less frequent (usually 1/20th of the L1). Thus, L2 can have higher latency (e.g. from 10 to 20 cycles) and have fewer ports. This allows designers to make it bigger.
L1 and L2 play very different roles. If L1 is made bigger, it will increase L1 access latency which will drastically reduce performance because it will make all dependent loads slower and harder for out-of-order execution to hide. L1 size is barely debatable.
If we removed L2, L1 misses will have to go to the next level, say memory. This means that a lot of access will be going to memory which would imply we need more memory bandwidth, which is already a bottleneck. Thus, keeping the L2 around is favorable.
Experts often refer to L1 as a latency filter (as it makes the common case of L1 hits faster) and L2 as a bandwidth filter as it reduces memory bandwidth usage.
Note: I have assumed a 2-level cache hierarchy in my argument to make it simpler. In many of today's multicore chips, there's an L3 cache shared between all the cores, while each core has its own private L1 and maybe L2. In these chips, the shared last-level cache (L3) plays the role of memory bandwidth filter. L2 plays the role of on-chip bandwidth filter, i.e. it reduces access to the on-chip interconnect and the L3. This allows designers to use a lower-bandwidth interconnect like a ring, and a slow single-port L3, which allows them to make L3 bigger.
Perhaps worth mentioning that the number of ports is a very important design point because it affects how much chip area the cache consumes. Ports add wires to the cache which consumes a lot of chip area and power.
There are different reasons for that.
L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing (and expensive) than the slowest hardware and yet cheaper than the faster hardware you have. Even if you decided to double the L1 cache, you would also increment L2, to speedup L1-cache misses.
So why is there L2 cache at all? Well, L1 cache is usually more performant and expensive to build, and it is bound to a single core. This means that increasing the L1 size by a fixed quantity will have that cost multiplied by 4 in a dual core processor, or by 8 in a quad core. L2 is usually shared by different cores --depending on the architecture it can be shared across a couple or all cores in the processor, so the cost of increasing L2 would be smaller even if the price of L1 and L2 were the same --which it is not.
#Aater's answer explains some of the basics. I'll add some more details + an examples of the real cache organization on Intel Haswell and AMD Piledriver, with latencies and other properties, not just size.
For some details on IvyBridge, see my answer on "How can cache be that fast?", with some discussion of the overall load-use latency including address-calculation time, and widths of the data busses between different levels of cache.
L1 needs to be very fast (latency and throughput), even if that means a limited hit-rate. L1d also needs to support single-byte stores on almost all architectures, and (in some designs) unaligned accesses. This makes it hard to use ECC (error correction codes) to protect the data, and in fact some L1d designs (Intel) just use parity, with better ECC only in outer levels of cache (L2/L3) where the ECC can be done on larger chunks for lower overhead.
It's impossible to design a single level of cache that could provide the low average request latency (averaged over all hits and misses) of a modern multi-level cache. Since modern systems have multiple very hungry cores all sharing a connection to the same relatively-high latency DRAM, this is essential.
Every core needs its own private L1 for speed, but at least the last level of cache is typically shared, so a multi-threaded program that reads the same data from multiple threads doesn't have to go to DRAM for it on each core. (And to act as a backstop for data written by one core and read by another). This requires at least two levels of cache for a sane multi-core system, and is part of the motivation for more than 2 levels in current designs. Modern multi-core x86 CPUs have a fast 2-level cache in each core, and a larger slower cache shared by all cores.
L1 hit-rate is still very important, so L1 caches are not as small / simple / fast as they could be, because that would reduce hit rates. Achieving the same overall performance would thus require higher levels of cache to be faster. If higher levels handle more traffic, their latency is a bigger component of the average latency, and they bottleneck on their throughput more often (or need higher throughput).
High throughput often means being able to handle multiple reads and writes every cycle, i.e. multiple ports. This takes more area and power for the same capacity as a lower-throughput cache, so that's another reason for L1 to stay small.
L1 also uses speed tricks that wouldn't work if it was larger. i.e. most designs use Virtually-Indexed, Physically Tagged (VIPT) L1, but with all the index bits coming from below the page offset so they behave like PIPT (because the low bits of a virtual address are the same as in the physical address). This avoids synonyms / homonyms (false hits or the same data being in the cache twice, and see Paul Clayton's detailed answer on the linked question), but still lets part of the hit/miss check happen in parallel with the TLB lookup. A VIVT cache doesn't have to wait for the TLB, but it has to be invalidated on every change to the page tables.
On x86 (which uses 4kiB virtual memory pages), 32kiB 8-way associative L1 caches are common in modern designs. The 8 tags can be fetched based on the low 12 bits of the virtual address, because those bits are the same in virtual and physical addresses (they're below the page offset for 4kiB pages). This speed-hack for L1 caches only works if they're small enough and associative enough that the index doesn't depend on the TLB result. 32kiB / 64B lines / 8-way associativity = 64 (2^6) sets. So the lowest 6 bits of an address select bytes within a line, and the next 6 bits index a set of 8 tags. This set of 8 tags is fetched in parallel with the TLB lookup, so the tags can be checked in parallel against the physical-page selection bits of the TLB result to determine which (if any) of the 8 ways of the cache hold the data. (Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical)
Making a larger L1 cache would mean it had to either wait for the TLB result before it could even start fetching tags and loading them into the parallel comparators, or it would have to increase in associativity to keep log2(sets) + log2(line_size) <= 12. (More associativity means more ways per set => fewer total sets = fewer index bits). So e.g. a 64kiB cache would need to be 16-way associative: still 64 sets, but each set has twice as many ways. This makes increasing L1 size beyond the current size prohibitively expensive in terms of power, and probably even latency.
Spending more of your power budget on L1D cache logic would leave less power available for out-of-order execution, decoding, and of course L2 cache and so on. Getting the whole core to run at 4GHz and sustain ~4 instructions per clock (on high-ILP code) without melting requires a balanced design. See this article: Modern Microprocessors: A 90-Minute Guide!.
The larger a cache is, the more you lose by flushing it, so a large VIVT L1 cache would be worse than the current VIPT-that-works-like-PIPT. And a larger but higher-latency L1D would probably also be worse.
According to #PaulClayton, L1 caches often fetch all the data in a set in parallel with the tags, so it's there ready to be selected once the right tag is detected. The power cost of doing this scales with associativity, so a large highly-associative L1 would be really bad for power-use as well as die-area (and latency). (Compared to L2 and L3, it wouldn't be a lot of area, but physical proximity is important for latency. Speed-of-light propagation delays matter when clock cycles are 1/4 of a nanosecond.)
Slower caches (like L3) can run at a lower voltage / clock speed to make less heat. They can even use different arrangements of transistors for each storage cell, to make memory that's more optimized for power than for high speed.
There are a lot of power-use related reasons for multi-level caches. Power / heat is one of the most important constraints in modern CPU design, because cooling a tiny chip is hard. Everything is a tradeoff between speed and power (and/or die area). Also, many CPUs are powered by batteries or are in data-centres that need extra cooling.
L1 is almost always split into separate instruction and data caches. Instead of an extra read port in a unified L1 to support code-fetch, we can have a separate L1I cache tied to a separate I-TLB. (Modern CPUs often have an L2-TLB, which is a second level of cache for translations that's shared by the L1 I-TLB and D-TLB, NOT a TLB used by the regular L2 cache). This gives us 64kiB total of L1 cache, statically partitioned into code and data caches, for much cheaper (and probably lower latency) than a monster 64k L1 unified cache with the same total throughput. Since there is usually very little overlap between code and data, this is a big win.
L1I can be placed physically close to the code-fetch logic, while L1D can be physically close to the load/store units. Speed-of-light transmission-line delays are a big deal when a clock cycle lasts only 1/3rd of a nanosecond. Routing the wiring is also a big deal: e.g. Intel Broadwell has 13 layers of copper above the silicon.
Split L1 helps a lot with speed, but unified L2 is the best choice.
Some workloads have very small code but touch lots of data. It makes sense for higher-level caches to be unified to adapt to different workloads, instead of statically partitioning into code vs. data. (e.g. almost all of L2 will be caching data, not code, while running a big matrix multiply, vs. having a lot of code hot while running a bloated C++ program, or even an efficient implementation of a complicated algorithm (e.g. running gcc)). Code can be copied around as data, not always just loaded from disk into memory with DMA.
Caches also need logic to track outstanding misses (since out-of-order execution means that new requests can keep being generated before the first miss is resolved). Having many misses outstanding means you overlap the latency of the misses, achieving higher throughput. Duplicating the logic and/or statically partitioning between code and data in L2 would not be good.
Larger lower-traffic caches are also a good place to put pre-fetching logic. Hardware pre-fetching enables good performance for things like looping over an array without every piece of code needing software-prefetch instructions. (SW prefetch was important for a while, but HW prefetchers are smarter than they used to be, so that advice in Ulrich Drepper's otherwise excellent What Every Programmer Should Know About Memory is out-of-date for many use cases.)
Low-traffic higher level caches can afford the latency to do clever things like use an adaptive replacement policy instead of the usual LRU. Intel IvyBridge and later CPUs do this, to resist access patterns that get no cache hits for a working set just slightly too large to fit in cache. (e.g. looping over some data in the same direction twice means it probably gets evicted just before it would be reused.)
A real example: Intel Haswell. Sources: David Kanter's microarchitecture analysis and Agner Fog's testing results (microarch pdf). See also Intel's optimization manuals (links in the x86 tag wiki).
Also, I wrote up a separate answer on: Which cache mapping technique is used in intel core i7 processor?
Modern Intel designs use a large inclusive L3 cache shared by all cores as a backstop for cache-coherence traffic. It's physically distributed between the cores, with 2048 sets * 16-way (2MiB) per core (with an adaptive replacement policy in IvyBridge and later).
The lower levels of cache are per-core.
L1: per-core 32kiB each instruction and data (split), 8-way associative. Latency = 4 cycles. At least 2 read ports + 1 write port. (Maybe even more ports to handle traffic between L1 and L2, or maybe receiving a cache line from L2 conflicts with retiring a store.) Can track 10 outstanding cache misses (10 fill buffers).
L2: unified per-core 256kiB, 8-way associative. Latency = 11 or 12 cycles. Read bandwidth: 64 bytes / cycle. The main prefetching logic prefetches into L2. Can track 16 outstanding misses. Can supply 64B per cycle to the L1I or L1D. Actual port counts unknown.
L3: unified, shared (by all cores) 8MiB (for a quad-core i7). Inclusive (of all the L2 and L1 per-core caches). 12 or 16 way associative. Latency = 34 cycles. Acts as a backstop for cache-coherency, so modified shared data doesn't have to go out to main memory and back.
Another real example: AMD Piledriver: (e.g. Opteron and desktop FX CPUs.) Cache-line size is still 64B, like Intel and AMD have used for several years now. Text mostly copied from Agner Fog's microarch pdf, with additional info from some slides I found, and more details on the write-through L1 + 4k write-combining cache on Agner's blog, with a comment that only L1 is WT, not L2.
L1I: 64 kB, 2-way, shared between a pair of cores (AMD's version of SMD has more static partitioning than Hyperthreading, and they call each one a core. Each pair shares a vector / FPU unit, and other pipeline resources.)
L1D: 16 kB, 4-way, per core. Latency = 3-4 c. (Notice that all 12 bits below the page offset are still used for index, so the usual VIPT trick works.) (throughput: two operations per clock, up to one of them being a store). Policy = Write-Through, with a 4k write-combining cache.
L2: 2 MB, 16-way, shared between two cores. Latency = 20 clocks. Read throughput 1 per 4 clock. Write throughput 1 per 12 clock.
L3: 0 - 8 MB, 64-way, shared between all cores. Latency = 87 clock. Read throughput 1 per 15 clock. Write throughput 1 per 21 clock
Agner Fog reports that with both cores of a pair active, L1 throughput is lower than when the other half of a pair is idle. It's not known what's going on, since the L1 caches are supposed to be separate for each core.
The other answers here give specific and technical reasons why L1 and L2 are sized as they are, and while many of them are motivating considerations for particular architectures, they aren't really necessary: the underlying architectural pressure leading to increasing (private) cache sizes as you move away from the core is fairly universal and is the same as the reasoning for multiple caches in the first place.
The three basic facts are:
The memory accesses for most applications exhibit a high degree of temporal locality, with a non-uniform distribution.
Across a large variety of process and designs, cache size and cache speed (latency and throughput) can be traded off against each other1.
Each distinct level of cache involves incremental design and performance cost.
So at a basic level, you might be able to say double the size of the cache, but incur a latency penalty of 1.4 compared to the smaller cache.
So it becomes an optimization problem: how many caches should you have and how large should they be? If memory access was totally uniform within the working set size, you'd probably end up with a single fairly large cache, or no cache at all. However, access is strongly non-uniform, so a small-and-fast cache can capture a large number of accesses, disproportionate to it's size.
If fact 2 didn't exist, you'd just create a very big, very fast L1 cache within the other constraints of your chip and not need any other cache levels.
If fact 3 didn't exist, you'd end up with a huge number of fine-grained "caches", faster and small at the center, and slower and larger outside, or perhaps a single cache with variable access times: faster for the parts closest to the core. In practice, rule 3 means that each level of cache has an additional cost, so you usually end up with a few quantized levels of cache2.
Other Constraints
This gives a basic framework to understand cache count and cache sizing decisions, but there are secondary factors at work as well. For example, Intel x86 has 4K page sizes and their L1 caches use a VIPT architecture. VIPT means that the size of the cache divided by the number of ways cannot be larger3 than 4 KiB. So an 8-way L1 cache as used on the half dozen Intel designs can be at most 4 KiB * 8 = 32 KiB. It is probably no coincidence that that's exactly the size of the L1 cache on those designs! If it weren't for this constraint, it is entirely possible you'd have seen lower-associativity and/or larger L1 caches (e.g., 64 KiB, 4-way).
1 Of course, there are other factors involved in the tradeoff as well, such as area and power, but holding those factors constant the size-speed tradeoff applies, and even if not held constant the basic behavior is the same.
2 In addition to this pressure, there is a scheduling benefit to known-latency caches, like most L1 designs: and out-of-order scheduler can optimistically submit operations that depend on a memory load on the cycle that the L1 cache would return, reading the result off the bypass network. This reduces contention and perhaps shaves a cycle of latency off the critical path. This puts some pressure on the innermost cache level to have uniform/predictable latency and probably results in fewer cache levels.
3 In principle, you can use VIPT caches without this restriction, but only by requiring OS support (e.g., page coloring) or with other constraints. The x86 arch hasn't done that and probably can't start now.
For those interested in this type of questions, my university recommends Computer Architecture: A Quantitative Approach and Computer Organization and Design: The Hardware/Software Interface. Of course, if you don't have time for this, a quick overview is available on Wikipedia.
I think the main reason for this is, that L1-Cache is faster and so it's more expensive.
https://en.wikichip.org/wiki/amd/microarchitectures/zen#Die
Compare the size of the L1, L2, and L3 caches physical size for an AMD Zen core, for example. The density increases dramatically with the cache level.
logically, the question answers itself.
If L1 were bigger than L2 (combined), then there would be no need of L2 Cache.
Why would you store your stuff on tape-drive if you can store all of it on HDD ?