Slowdown when accessing data at page boundaries? - performance

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)
I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;}.
I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.
I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.
What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...
Update (July 11)
We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.

There's 2 things here:
Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-4096 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 4096 bytes means those address bits are the same for all accesses so you're only one of the sets in L1d cache, and some small number in L2.
Striding by 1024 means you'd only be using 4 sets in L1d, with smaller powers of 2 using progressively more sets, but non-power-of-2 distributing over all the sets. (Intel CPUs have used 32KiB 8-way associative L1d caches for a long time; 32K/8 = 4K per way. Ice Lake bumped it up to 48K 12-way, so the same indexing where the set depends only on bits below the page number. This is not a coincidence for VIPT caches that want to index in parallel with TLB.)
But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)
Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.
This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?
Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?
Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.
I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.
Prefetching across virtual page boundaries is likely also a general problem.

Related

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store.
Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps)
However, I could not measure any difference when using _mm256_stream_load instead of _mm256_load_ps.
Does anyone have an example where _mm256_stream_load_si256 can be used to actually improves performance ?
(Instruction set & Hardware is AVX2 on AMD Zen2, 64 cores)
for(size_t i=0; i < 1000000000/*larger than L3 cache-size*/; i+=8 )
{
#ifdef USE_STREAM_LOAD
__m256 a = _mm256_castsi256_ps (_mm256_stream_load_si256((__m256i *)source+i));
#else
__m256 a = _mm256_load_ps( source+i );
#endif
a *= a;
#ifdef USE_STREAM_STORE
_mm256_stream_ps (destination+i, a);
#else
_mm256_store_ps (destination+i, a);
#endif
}
stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).
The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.
Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).
See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.
There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.
But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.
If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.
See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.
Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).
Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.
If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).
If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.
Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.
Further reading:
Enhanced REP MOVSB for memcpy - RFO vs. no-RFO stores (including NT stores), and how per-core memory bandwidth can be limited to the latency-bandwidth product given latency of handing off cache lines to lower levels and the number of LFBs to track them. Especially on Intel server chips.
Non-temporal loads and the hardware prefetcher, do they work together? - no, NT loads are only useful on WC memory, where HW prefetch doesn't work. They kind of exist to fill that gap.

Optimal buffer size to avoid cache misses for recent i7 / i9 CPUs

Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.
I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.
Let's expect current i7/i9 CPUs (2018)
Any ideas?
Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).
Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)
Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.
Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.
Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.
Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.
Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.

How multilevel CPU caches having the same cache line size work?

Note: I'm not sure if StackOverflow is the correct place for that question or if there is a more suitable StackExchange sub for this
I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size. I can totally undrestand how this works (or at least I think so) when used with quite simple architectures. Then I came accross this question. Question is how cache memories of the same cache line can cooperate?
This is how I percieve the way of cache memories with different cache line size work. For simplicity, lets suppose there are no different caches for data and for instructions and we only have L1 and L2 caches (L3 and L4 not exist).
If L1 has cache line size of 64 bytes and L2 of 128 bytes, when we have cache miss on L2 and we need to fetch the desired byte or word from main memory, we also bring its closest bytes or words in order to fill the 128 bytes of the L2 cache line. Then because of the locality of the references to memory locations produced by the processor we have higher probability of geting a hit on L2 whe missing on L1. But if we had equal cache line sizes this of course wouldn't happen, with the previous algorithm. Can you explain me some sort/simple algorithm or implementation of how modern CPUs take advantage of caches having the same line size?
Thanks in advance.
I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size.
That's not true for most CPUs. Usually the line size is the same in all caches, but the total size increases. Often also the associativity, but usually not by as much as the total size, so the number of sets typically increases.
The point of multi-level caches is to get low latency and large size without needing a single cache that's both large and low latency (because that's physically impossible).
HW prefetch into L2 and/or L1 is what makes sequential read work well, not larger line size in out levels of cache. (And in multi-core CPUs, private L1/L2 + shared L3 provide private latency + bandwidth filters for the memory workload hits the shared domain, but then you have L3 as a coherency backstop instead of hitting DRAM for data that's shared between cores.)
Having different line sizes in different caches is more complicated, especially in a multi-core system where caches have to maintain coherency with each other using MESI. Transferring around whole cache between caches works well.
But if if L1D lines are 64B and private L2 / shared L3 lines are 128B, then a load on one core might force the L2 cache to request both halves separately in case separate cores had each of the two halves of the 128B line modified. Sounds really complicated, and puts more logic into the outer-level cache.
(Paul Clayton's answer on the question you linked points out that a possible solution to that problem is separate validity bits for the two halves of a larger cache line, or even separate MESI coherency state. But still sharing the same tag, so if they are both valid then they have to be caching two halves of the same 128B block.)

hit ratio in cache - reading long sequence of bytes

Let assume that one row of cache has size 2^nB. Which hit ratio expected in the sequential reading byte by byte long contiguous memory?
To my eye it is (2^n - 1) / 2^n.
However, I am not sure if I am right. What do you think ?
yes, looks right for simple hardware (non-pipelined with no prefetching). e.g. 1 miss and 63 hits for 64B cache lines.
On real hardware, even in-order single-issue (non-superscalar) CPUs, miss under miss (multiple outstanding misses) is usually supported, so you will see misses until you run out of load buffers. This makes memory accesses pipelined as well, which is useful when misses to different cache lines can be in flight at once, instead of waiting the full latency for each one.
Real hardware will also have hardware prefetching. For example, have a look at Intel's article about disabling HW prefetching for some use-cases.
HW prefetching can probably keep up with a one-byte-at-a-time loop on most CPUs, so with good prefetching you might see hardly any L1 cache misses.
See Ulrich Drepper's What Every Programmer Should Know About Memory, and other links in the x86 tag wiki for more about real HW performance.

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by the CPU core every clock cycle). It needs lots of read/write ports to support this high access bandwidth. Building a large cache with these properties is impossible. Thus, designers keep it small, e.g. 32KB in most processors today.
L2 is accessed only on L1 misses, so accesses are less frequent (usually 1/20th of the L1). Thus, L2 can have higher latency (e.g. from 10 to 20 cycles) and have fewer ports. This allows designers to make it bigger.
L1 and L2 play very different roles. If L1 is made bigger, it will increase L1 access latency which will drastically reduce performance because it will make all dependent loads slower and harder for out-of-order execution to hide. L1 size is barely debatable.
If we removed L2, L1 misses will have to go to the next level, say memory. This means that a lot of access will be going to memory which would imply we need more memory bandwidth, which is already a bottleneck. Thus, keeping the L2 around is favorable.
Experts often refer to L1 as a latency filter (as it makes the common case of L1 hits faster) and L2 as a bandwidth filter as it reduces memory bandwidth usage.
Note: I have assumed a 2-level cache hierarchy in my argument to make it simpler. In many of today's multicore chips, there's an L3 cache shared between all the cores, while each core has its own private L1 and maybe L2. In these chips, the shared last-level cache (L3) plays the role of memory bandwidth filter. L2 plays the role of on-chip bandwidth filter, i.e. it reduces access to the on-chip interconnect and the L3. This allows designers to use a lower-bandwidth interconnect like a ring, and a slow single-port L3, which allows them to make L3 bigger.
Perhaps worth mentioning that the number of ports is a very important design point because it affects how much chip area the cache consumes. Ports add wires to the cache which consumes a lot of chip area and power.
There are different reasons for that.
L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing (and expensive) than the slowest hardware and yet cheaper than the faster hardware you have. Even if you decided to double the L1 cache, you would also increment L2, to speedup L1-cache misses.
So why is there L2 cache at all? Well, L1 cache is usually more performant and expensive to build, and it is bound to a single core. This means that increasing the L1 size by a fixed quantity will have that cost multiplied by 4 in a dual core processor, or by 8 in a quad core. L2 is usually shared by different cores --depending on the architecture it can be shared across a couple or all cores in the processor, so the cost of increasing L2 would be smaller even if the price of L1 and L2 were the same --which it is not.
#Aater's answer explains some of the basics. I'll add some more details + an examples of the real cache organization on Intel Haswell and AMD Piledriver, with latencies and other properties, not just size.
For some details on IvyBridge, see my answer on "How can cache be that fast?", with some discussion of the overall load-use latency including address-calculation time, and widths of the data busses between different levels of cache.
L1 needs to be very fast (latency and throughput), even if that means a limited hit-rate. L1d also needs to support single-byte stores on almost all architectures, and (in some designs) unaligned accesses. This makes it hard to use ECC (error correction codes) to protect the data, and in fact some L1d designs (Intel) just use parity, with better ECC only in outer levels of cache (L2/L3) where the ECC can be done on larger chunks for lower overhead.
It's impossible to design a single level of cache that could provide the low average request latency (averaged over all hits and misses) of a modern multi-level cache. Since modern systems have multiple very hungry cores all sharing a connection to the same relatively-high latency DRAM, this is essential.
Every core needs its own private L1 for speed, but at least the last level of cache is typically shared, so a multi-threaded program that reads the same data from multiple threads doesn't have to go to DRAM for it on each core. (And to act as a backstop for data written by one core and read by another). This requires at least two levels of cache for a sane multi-core system, and is part of the motivation for more than 2 levels in current designs. Modern multi-core x86 CPUs have a fast 2-level cache in each core, and a larger slower cache shared by all cores.
L1 hit-rate is still very important, so L1 caches are not as small / simple / fast as they could be, because that would reduce hit rates. Achieving the same overall performance would thus require higher levels of cache to be faster. If higher levels handle more traffic, their latency is a bigger component of the average latency, and they bottleneck on their throughput more often (or need higher throughput).
High throughput often means being able to handle multiple reads and writes every cycle, i.e. multiple ports. This takes more area and power for the same capacity as a lower-throughput cache, so that's another reason for L1 to stay small.
L1 also uses speed tricks that wouldn't work if it was larger. i.e. most designs use Virtually-Indexed, Physically Tagged (VIPT) L1, but with all the index bits coming from below the page offset so they behave like PIPT (because the low bits of a virtual address are the same as in the physical address). This avoids synonyms / homonyms (false hits or the same data being in the cache twice, and see Paul Clayton's detailed answer on the linked question), but still lets part of the hit/miss check happen in parallel with the TLB lookup. A VIVT cache doesn't have to wait for the TLB, but it has to be invalidated on every change to the page tables.
On x86 (which uses 4kiB virtual memory pages), 32kiB 8-way associative L1 caches are common in modern designs. The 8 tags can be fetched based on the low 12 bits of the virtual address, because those bits are the same in virtual and physical addresses (they're below the page offset for 4kiB pages). This speed-hack for L1 caches only works if they're small enough and associative enough that the index doesn't depend on the TLB result. 32kiB / 64B lines / 8-way associativity = 64 (2^6) sets. So the lowest 6 bits of an address select bytes within a line, and the next 6 bits index a set of 8 tags. This set of 8 tags is fetched in parallel with the TLB lookup, so the tags can be checked in parallel against the physical-page selection bits of the TLB result to determine which (if any) of the 8 ways of the cache hold the data. (Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical)
Making a larger L1 cache would mean it had to either wait for the TLB result before it could even start fetching tags and loading them into the parallel comparators, or it would have to increase in associativity to keep log2(sets) + log2(line_size) <= 12. (More associativity means more ways per set => fewer total sets = fewer index bits). So e.g. a 64kiB cache would need to be 16-way associative: still 64 sets, but each set has twice as many ways. This makes increasing L1 size beyond the current size prohibitively expensive in terms of power, and probably even latency.
Spending more of your power budget on L1D cache logic would leave less power available for out-of-order execution, decoding, and of course L2 cache and so on. Getting the whole core to run at 4GHz and sustain ~4 instructions per clock (on high-ILP code) without melting requires a balanced design. See this article: Modern Microprocessors: A 90-Minute Guide!.
The larger a cache is, the more you lose by flushing it, so a large VIVT L1 cache would be worse than the current VIPT-that-works-like-PIPT. And a larger but higher-latency L1D would probably also be worse.
According to #PaulClayton, L1 caches often fetch all the data in a set in parallel with the tags, so it's there ready to be selected once the right tag is detected. The power cost of doing this scales with associativity, so a large highly-associative L1 would be really bad for power-use as well as die-area (and latency). (Compared to L2 and L3, it wouldn't be a lot of area, but physical proximity is important for latency. Speed-of-light propagation delays matter when clock cycles are 1/4 of a nanosecond.)
Slower caches (like L3) can run at a lower voltage / clock speed to make less heat. They can even use different arrangements of transistors for each storage cell, to make memory that's more optimized for power than for high speed.
There are a lot of power-use related reasons for multi-level caches. Power / heat is one of the most important constraints in modern CPU design, because cooling a tiny chip is hard. Everything is a tradeoff between speed and power (and/or die area). Also, many CPUs are powered by batteries or are in data-centres that need extra cooling.
L1 is almost always split into separate instruction and data caches. Instead of an extra read port in a unified L1 to support code-fetch, we can have a separate L1I cache tied to a separate I-TLB. (Modern CPUs often have an L2-TLB, which is a second level of cache for translations that's shared by the L1 I-TLB and D-TLB, NOT a TLB used by the regular L2 cache). This gives us 64kiB total of L1 cache, statically partitioned into code and data caches, for much cheaper (and probably lower latency) than a monster 64k L1 unified cache with the same total throughput. Since there is usually very little overlap between code and data, this is a big win.
L1I can be placed physically close to the code-fetch logic, while L1D can be physically close to the load/store units. Speed-of-light transmission-line delays are a big deal when a clock cycle lasts only 1/3rd of a nanosecond. Routing the wiring is also a big deal: e.g. Intel Broadwell has 13 layers of copper above the silicon.
Split L1 helps a lot with speed, but unified L2 is the best choice.
Some workloads have very small code but touch lots of data. It makes sense for higher-level caches to be unified to adapt to different workloads, instead of statically partitioning into code vs. data. (e.g. almost all of L2 will be caching data, not code, while running a big matrix multiply, vs. having a lot of code hot while running a bloated C++ program, or even an efficient implementation of a complicated algorithm (e.g. running gcc)). Code can be copied around as data, not always just loaded from disk into memory with DMA.
Caches also need logic to track outstanding misses (since out-of-order execution means that new requests can keep being generated before the first miss is resolved). Having many misses outstanding means you overlap the latency of the misses, achieving higher throughput. Duplicating the logic and/or statically partitioning between code and data in L2 would not be good.
Larger lower-traffic caches are also a good place to put pre-fetching logic. Hardware pre-fetching enables good performance for things like looping over an array without every piece of code needing software-prefetch instructions. (SW prefetch was important for a while, but HW prefetchers are smarter than they used to be, so that advice in Ulrich Drepper's otherwise excellent What Every Programmer Should Know About Memory is out-of-date for many use cases.)
Low-traffic higher level caches can afford the latency to do clever things like use an adaptive replacement policy instead of the usual LRU. Intel IvyBridge and later CPUs do this, to resist access patterns that get no cache hits for a working set just slightly too large to fit in cache. (e.g. looping over some data in the same direction twice means it probably gets evicted just before it would be reused.)
A real example: Intel Haswell. Sources: David Kanter's microarchitecture analysis and Agner Fog's testing results (microarch pdf). See also Intel's optimization manuals (links in the x86 tag wiki).
Also, I wrote up a separate answer on: Which cache mapping technique is used in intel core i7 processor?
Modern Intel designs use a large inclusive L3 cache shared by all cores as a backstop for cache-coherence traffic. It's physically distributed between the cores, with 2048 sets * 16-way (2MiB) per core (with an adaptive replacement policy in IvyBridge and later).
The lower levels of cache are per-core.
L1: per-core 32kiB each instruction and data (split), 8-way associative. Latency = 4 cycles. At least 2 read ports + 1 write port. (Maybe even more ports to handle traffic between L1 and L2, or maybe receiving a cache line from L2 conflicts with retiring a store.) Can track 10 outstanding cache misses (10 fill buffers).
L2: unified per-core 256kiB, 8-way associative. Latency = 11 or 12 cycles. Read bandwidth: 64 bytes / cycle. The main prefetching logic prefetches into L2. Can track 16 outstanding misses. Can supply 64B per cycle to the L1I or L1D. Actual port counts unknown.
L3: unified, shared (by all cores) 8MiB (for a quad-core i7). Inclusive (of all the L2 and L1 per-core caches). 12 or 16 way associative. Latency = 34 cycles. Acts as a backstop for cache-coherency, so modified shared data doesn't have to go out to main memory and back.
Another real example: AMD Piledriver: (e.g. Opteron and desktop FX CPUs.) Cache-line size is still 64B, like Intel and AMD have used for several years now. Text mostly copied from Agner Fog's microarch pdf, with additional info from some slides I found, and more details on the write-through L1 + 4k write-combining cache on Agner's blog, with a comment that only L1 is WT, not L2.
L1I: 64 kB, 2-way, shared between a pair of cores (AMD's version of SMD has more static partitioning than Hyperthreading, and they call each one a core. Each pair shares a vector / FPU unit, and other pipeline resources.)
L1D: 16 kB, 4-way, per core. Latency = 3-4 c. (Notice that all 12 bits below the page offset are still used for index, so the usual VIPT trick works.) (throughput: two operations per clock, up to one of them being a store). Policy = Write-Through, with a 4k write-combining cache.
L2: 2 MB, 16-way, shared between two cores. Latency = 20 clocks. Read throughput 1 per 4 clock. Write throughput 1 per 12 clock.
L3: 0 - 8 MB, 64-way, shared between all cores. Latency = 87 clock. Read throughput 1 per 15 clock. Write throughput 1 per 21 clock
Agner Fog reports that with both cores of a pair active, L1 throughput is lower than when the other half of a pair is idle. It's not known what's going on, since the L1 caches are supposed to be separate for each core.
The other answers here give specific and technical reasons why L1 and L2 are sized as they are, and while many of them are motivating considerations for particular architectures, they aren't really necessary: the underlying architectural pressure leading to increasing (private) cache sizes as you move away from the core is fairly universal and is the same as the reasoning for multiple caches in the first place.
The three basic facts are:
The memory accesses for most applications exhibit a high degree of temporal locality, with a non-uniform distribution.
Across a large variety of process and designs, cache size and cache speed (latency and throughput) can be traded off against each other1.
Each distinct level of cache involves incremental design and performance cost.
So at a basic level, you might be able to say double the size of the cache, but incur a latency penalty of 1.4 compared to the smaller cache.
So it becomes an optimization problem: how many caches should you have and how large should they be? If memory access was totally uniform within the working set size, you'd probably end up with a single fairly large cache, or no cache at all. However, access is strongly non-uniform, so a small-and-fast cache can capture a large number of accesses, disproportionate to it's size.
If fact 2 didn't exist, you'd just create a very big, very fast L1 cache within the other constraints of your chip and not need any other cache levels.
If fact 3 didn't exist, you'd end up with a huge number of fine-grained "caches", faster and small at the center, and slower and larger outside, or perhaps a single cache with variable access times: faster for the parts closest to the core. In practice, rule 3 means that each level of cache has an additional cost, so you usually end up with a few quantized levels of cache2.
Other Constraints
This gives a basic framework to understand cache count and cache sizing decisions, but there are secondary factors at work as well. For example, Intel x86 has 4K page sizes and their L1 caches use a VIPT architecture. VIPT means that the size of the cache divided by the number of ways cannot be larger3 than 4 KiB. So an 8-way L1 cache as used on the half dozen Intel designs can be at most 4 KiB * 8 = 32 KiB. It is probably no coincidence that that's exactly the size of the L1 cache on those designs! If it weren't for this constraint, it is entirely possible you'd have seen lower-associativity and/or larger L1 caches (e.g., 64 KiB, 4-way).
1 Of course, there are other factors involved in the tradeoff as well, such as area and power, but holding those factors constant the size-speed tradeoff applies, and even if not held constant the basic behavior is the same.
2 In addition to this pressure, there is a scheduling benefit to known-latency caches, like most L1 designs: and out-of-order scheduler can optimistically submit operations that depend on a memory load on the cycle that the L1 cache would return, reading the result off the bypass network. This reduces contention and perhaps shaves a cycle of latency off the critical path. This puts some pressure on the innermost cache level to have uniform/predictable latency and probably results in fewer cache levels.
3 In principle, you can use VIPT caches without this restriction, but only by requiring OS support (e.g., page coloring) or with other constraints. The x86 arch hasn't done that and probably can't start now.
For those interested in this type of questions, my university recommends Computer Architecture: A Quantitative Approach and Computer Organization and Design: The Hardware/Software Interface. Of course, if you don't have time for this, a quick overview is available on Wikipedia.
I think the main reason for this is, that L1-Cache is faster and so it's more expensive.
https://en.wikichip.org/wiki/amd/microarchitectures/zen#Die
Compare the size of the L1, L2, and L3 caches physical size for an AMD Zen core, for example. The density increases dramatically with the cache level.
logically, the question answers itself.
If L1 were bigger than L2 (combined), then there would be no need of L2 Cache.
Why would you store your stuff on tape-drive if you can store all of it on HDD ?

Resources