I know that l1 and l2 caches are levels in multi-level cache.
I would like to know where each level cache is placed, and what is the maximum number of cache levels allowed?
Both of these depend on the CPU. There are CPUs which have no cache at all, there are CPUs which have the L1 cache on die and the L2 cache on a separate die on the same chip or even on a separate chip, or there are CPUs which have both L1 and L2 cache on the same die as the CPU core.
There are multi-core, multi-chip CPUs where each core has its own L1 cache on die, the 4 cores of one multi-core chip share an L2 cache that is on chip, but on a separate die, and the 2 chips share an L3 cache that is on a separate chip, but in the same package. Sometimes, there are also so-called CPU books which contain multiple chip packages, which might or might not have their own shared cache, which would then be an L4 cache.
Of course, multi-core chips don't have to share their L2 cache, they can also have private L2 caches.
And it's not always obvious, what level a certain cache is, or even whether or not a piece of RAM is a cache at all.
For example, on later Intel 80486 processors, there was an L1 cache on the chip and an L2 cache on the motherboard. But then AMD came out with a socket-compatible CPU that had both an L1 and L2 cache on the chip. So, the exact same cache chip on the motherboard was either an L2 or L3 cache, depending on what kind of CPU you used.
On the Cell BE CPU, the SPEs have 256 KiByte of RAM each. Except that this RAM has about the same size and the same speed as a typical L2 cache, and since the SPEs don't have any other caches, you could also view this as a cache. However, caches are normally managed automatically by the CPU, whereas RAM is typically managed by the user program, the language runtime or the OS, not the CPU. So, is this RAM or a cache? It turns out that, in order to achieve best performance, you should really not view this as RAM, but more as a software-controlled cache.
The different between L1 and L2 cache
Although both L1 and L2 are cache memories they have their key differences. L1 and L2 are the first and second cache in the hierarchy of cache levels.
L1 has a smaller memory capacity than L2.
Also, L1 can be accessed faster than L2.
L2 is accessed only if the requested data in not found in L1.**
L1 is usually in-built to the chip, while L2 is soldered on the
motherboard very close to the chip.
Therefore, L1 has a very little delay compared to L2. Because L1 is
implemented using SRAM and L2 is implemented using DRAM, L1 does not
need refreshing, while L2 needs to be refreshed.
If the caches are strictly inclusive, all data in L1 can be found in
L2 as well. However, if the caches are exclusive, same data will not
be available in both L1 and L2.
IF YOU WANT TO READ DEEPLY CLICK THIS LINK
Taken from this link -
L1 and L2 are levels of cache memory in a computer. If the computer processor can find the data it needs for its next operation in cache memory, it will save time compared to having to get it from random access memory. L1 is "level-1" cache memory, usually built onto the microprocessor chip itself. For example, the Intel MMX microprocessor comes with 32 thousand bytes of L1.
L2 (that is, level-2) cache memory is on a separate chip (possibly on an expansion card) that can be accessed more quickly than the larger "main" memory. A popular L2 cache memory size is 1,024 kilobytes (one megabyte).
Complete Cache architecture is here in WIKI
Related
I don't know why L1 Cache and L2 Cache save the same data.
For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from.
But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space?
I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking.
Not all caches are like that. The Cache Inclusion Policy for an outer cache can be Inclusive, Exclusive, or Not-Inclusive / Not-Exclusive.
NINE is the "normal" case, not maintaining either special property, but L2 does tend to have copies of most lines in L1 for the reason you describe in the question. If L2 is less associative than L1 (like in Skylake-client) and the access pattern creates a lot of conflict misses in L2 (unlikely), you could get a decent amount of data that's only in L1. And maybe in other ways, e.g. via hardware prefetch, or from L2 evictions of data due to code-fetch, because real CPUs use split L1i / L1d caches.
For the outer caches to be useful, you need some way for data to enter them so you can get an L2 hit sometime after the line was evicted from the smaller L1. Having inner caches like L1d fetch through outer caches gives you that for free, and has some advantages. You can put hardware prefetch logic in an outer or middle level of cache, which doesn't have to be as high-performance as L1. (e.g. Intel CPUs have most of their prefetch logic in the private per-core L2, but also some prefetch logic in L1d).
The other main option is for the outer cache to be a victim cache, i.e. lines enter it only when they're evicted from L1. So you can loop over an array of L1 + L2 size and probably still get L2 hits. The extra logic to implement this is useful if you want a relatively large L1 compared to L2, so the total size is more than a little larger than L2 alone.
With an exclusive L2, an L1 miss / L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set.
Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. AMD K10 / Barcelona). Both of those caches are private per-core caches, not shared, so it's like the simple L1 / L2 situation for a single core CPU you're talking about.
Things get more complicated with multi-core CPUs and shared caches!
Barcelona's shared L3 cache is also mostly exclusive of the inner caches, but not strictly. David Kanter explains:
First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines.
AMD's successor to K10/Barcelona is Bulldozer. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is also victim cache, and thus mostly exclusive of L2. It's probably like Barcelona's L3.
But Bulldozer's L1d is a small write-through cache with an even smaller (4k) write-combining buffer, so it's mostly inclusive of L2. Bulldozer's write-through L1d is generally considered a mistake in the CPU design world, and Ryzen went back to a normal 32kiB write-back L1d like Intel has been using all along (with great results). A pair of weak integer cores form a "cluster" that shares an FPU/SIMD unit, and shares a big L2 that's "mostly inclusive". (i.e. probably a standard NINE). This cluster thing is Bulldozer's alternative to SMT / Hyperthreading, which AMD also ditched for Ryzen in favour of normal SMT with a massively wide out-of-order core.
Ryzen also has some exclusivity between core clusters (CCX), apparently, but I haven't looked into the details.
I've been talking about AMD first because they have used exclusive caches in recent designs, and seem to have a preference for victim caches. Intel hasn't tried as many different things, because they hit on a good design with Nehalem and stuck with it until Skylake-AVX512.
Intel Nehalem and later use a large shared tag-inclusive L3 cache. For lines that are modified / exclusive (MESI) in a private per-core L1d or L2 (NINE) cache, the L3 tags still indicate which cores (might) have a copy of a line, so requests from one core for exclusive access to a line don't have to be broadcast to all cores, only to cores that might still have it cached. (i.e. it's a snoop filter for coherency traffic, which lets CPUs scale up to dozens of cores per chip without flooding each other with requests when they're not even sharing memory.)
i.e. L3 tags hold info about where a line is (or might be) cached in an L2 or L1 somewhere, so it knows where to send invalidation messages instead of broadcasting messages from every core to all other cores.
With Skylake-X (Skylake-server / SKX / SKL-SP), Intel dropped that and made L3 NINE and only a bit bigger than the total per-core L2 size. But there's still a snoop filter, it just doesn't have data. I don't know what Intel's planning to do for future (dual?)/quad/hex-core laptop / desktop chips (e.g. Cannonlake / Icelake). That's small enough that their classic ring bus would still be great, so they could keep doing that in mobile/desktop parts and only use a mesh in high-end / server parts, like they are in Skylake.
Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:
CPU architecture experts spend time discussing what makes for a good design on that forum. While searching for stuff about exclusive caches, I found this thread, where some disadvantages of strictly inclusive last-level caches are presented. e.g. they force private per-core L2 caches to be small (otherwise you waste too much space with duplication between L3 and L2).
Also, L2 caches filter requests to L3, so when its LRU algorithm needs to drop a line, the one it's seen least-recently can easily be one that stays permanently hot in L2 / L1 of a core. But when an inclusive L3 decides to drop a line, it has to evict it from all inner caches that have it, too!
David Kanter replied with an interesting list of advantages for inclusive outer caches. I think he's comparing to exclusive caches, rather than to NINE. e.g. his point about data sharing being easier only applies vs. exclusive caches, where I think he's suggesting that a strictly exclusive cache hierarchy might cause evictions when multiple cores want the same line even in a shared/read-only manner.
Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple threads run on the same core each thread may find their data in one of the sub-pages. I'm just coming up with this off the top of my head so I'm very interested in educating myself on what is really going on underneath the scenes. Can anyone share their insights or provide me with a link that will cure me of my ignorance?
Many thanks in advance.
There is single (sliced) L3 cache in single-socket chip, and several L2 caches (one per real physical core).
L3 cache caches data in segments of size of 64 bytes (cache lines), and there is special Cache coherence protocol between L3 and different L2/L1 (and between several chips in the NUMA/ccNUMA multi-socket systems too); it tracks which cache line is actual, which is shared between several caches, which is just modified (and should be invalidated from other caches). Some of protocols (cache line possible states and state translation): https://en.wikipedia.org/wiki/MESI_protocol, https://en.wikipedia.org/wiki/MESIF_protocol, https://en.wikipedia.org/wiki/MOESI_protocol
In older chips (epoch of Core 2) cache coherence was snooped on shared bus, now it is checked with help of directory.
In real life L3 is not just "single" but sliced into several slices, each of them having high-speed access ports. There is some method of selecting the slice based on physical address, which allow multicore system to do many accesses at every moment (each access will be directed by undocumented method to some slice; when two cores uses same physical address, their accesses will be served by same slice or by slices which will do cache coherence protocol checks).
Information about L3 cache slices was reversed in several papers:
https://cmaurice.fr/pdf/raid15_maurice.pdf Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters
https://eprint.iacr.org/2015/690.pdf Systematic Reverse Engineering of Cache Slice Selection in Intel Processors
https://arxiv.org/pdf/1508.03767.pdf Cracking Intel Sandy Bridge’s Cache Hash Function
With recent chips programmer has ability to partition the L3 cache between applications "Cache Allocation Technology" (v4 Family): https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models https://danluu.com/intel-cat/ https://lwn.net/Articles/659161/
Modern Intel L3 caches (since Nehalem) use a 64B line size, the same as L1/L2. They're shared, and inclusive.
See also http://www.realworldtech.com/nehalem/2/
Since SnB at least, each core has part of the L3, and they're on a ring bus. So in big Xeons, L3 size scales linearly with number of cores.
See also Which cache mapping technique is used in intel core i7 processor? where I wrote a much larger and more complete answer.
I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.
Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.
Does this seem legitimate?
First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)
But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.
Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.
You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.
Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.
Short answer: many of these speculative memory accesses will be to the same memory location. They will be squashed and combined.
The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)
By the way, there may be a performance event for this: Squashed_Cache_Misses.
There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.
([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)
E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...
If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.
(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)
Some more possibilities:
If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.
Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.
The write policy of a data cache determines whether a store hit writes its data only on that cache (write-back or copy-back) or also at the following level of the cache hierarchy (write-through).
Hence, a store that hits at a write-through L1-D cache, also writes its data at the L2 cache.
This could be another source of L2 accesses that do not come from L1 cache misses.
I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/
could L1/L2 cache line each cache multiple copies of the main memory data word?
It's possible that the main memory is in a cache more than once. Obviously that's true and a common occurrence for multiprocessor machines. But even on uni processor machines, it can happen.
Consider a Pentium CPU that has a split L1 instruction/data cache. Instructions only go to the I-cache, data only to the D-cache. Now if the OS allows self modifying code, the same memory could be loaded into both the I- and D-cache, once as data, once as instructions. Now you have that data twice in the L1 cache. Therefore a CPU with such a split cache architecture must employ a cache coherence protocol to avoid race conditions/corruption.
No - if it's already in the cache the MMU will use that rather than creating another copy.
Every cache basically stores some small subset of the whole memory. When CPU needs a word from memory it first goes to L1, then to L2 cache and so on, before the main memory is checked.
So a particular memory word can be in L2 and in L1 simultaneously, but it can't be stored two times in L1, because that is not necessary.
Yes it can. L1 copy is updated but has not been flushed to L2. This happens only if L1 and L2 are non-exclusive caches. This is obvious for uni-processors but it is even more so for multi-processors which typically have their own L1 caches for each core.
It all depends on the cache architecture - whether it guarantees any sort of thing.