My intel i3 processor having L3 cache of size 3072kb i want to partition(divide) L3 cache based on the number of cores present which is 2 in intel i3 Clarkdale. If anyone have done something related to cache then please reply?
Related
Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM bandwidth per socket is 307GB/s (source).
In this plot (source),
we see that all-cores bandwidth drops off to roughly half, when the data exceeds 4MB. This makes sense: 64x 64KiB = 4 MiB is the size of the L1 data cache.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
It looks from the plot like they may not have tested any sizes between 32M and 64M. Looks like a straight line between those points on all 3 CPUs.
Since 64M is the total size of both L2 and L3, I'd expect a test like this to have slowed most of the way down at 64M. As Brendan says, page tables and a bit of code will take space, competing with the actual intended test data. If the benchmark loop is tight, stack won't come into play, except for interrupt handling.
Once you're evicting anything from a working set slightly larger than cache, you often evict almost everything before getting back to it, depending on pseudo-LRU luck. I'd expect a test size or 48 or even 56 MiB to be a lot closer to the 32 MiB data point than the 64 MiB data point.
Can all of L2/L3 cache be used by data?
In theory, yes; but only if there's no "non-data" (code) in the cache, only if you count "all data" (and don't just count a process' data and ignore things like stack and page tables), and only if there isn't any aliasing problems.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there?
For a fully associative cache I'd expect a sudden drop off at/near 32 MiB. However, large caches are almost never fully associative as it costs way to much to find anything in the cache.
As associativity decreases the chance of conflicts increases. For example, for an 8-way associative 64 MiB cache the pathological case is that everything conflicts and you're only able to effectively use 8 MiB of it.
More specifically, for a 64 MiB cache (with unknown associativity), and an "assumed Linux" environment that lacks support for cache coloring, it's reasonable to expect a smooth drop off that ends at 64 MiB.
Just to be clear, on a running Graviton 3 in AWS, an lscpu gives me 32MiB for L3 and not 64 MiB.
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
The original question is assuming an L3 of 64 MiB across all cores.
Blockquote
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
Blockquote
I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
Is it correct to say the following statements:
Statement 1: Global Miss Rate for an L2 cache is the same as the Local Miss Rate for an L2 cache. Since for a memory reference accessing L2, to miss L1 and L2 is equivalent to missing L2, because because it has already missed L1 by virtue of attempting to access L2 (For a system of 2 hierarchical caches - L1 and L2)
Statement 2: The Global Miss Rate for an L1 cache is the same as Local Miss Rate for L1 Cache (For a system of 2 hierarchical caches - L1 and L2)
Statement 3: The Global Miss Rate for an Ln cache is the same as the Local Miss Rate for Ln cache (For a system of 2 hierarchical caches - L1, L2, L3,..., Ln)
Let me answer this as clearly as possible.
Local Miss Rate = Number of Misses in this cache/Number of references to this cache
Global Miss Rate = No. of Misses in this cache/Total number of references made by the processor
Statement 1: False
Explanation:
No. of accesses to L2 = No. of misses in L1
Total number of references made by the processor = Number of accesses to L1
(Since all memory references made by the processor are first served (tried to be served) by the L1 cache.
Therefore for L2 cache,
Local MR = Number of misses in L2 / Number of misses in L1
Global MR = Number of misses in L2 / Number of accesses to L1
So, for L2, Local MR != Global MR
Statement 2: True
Explanation:
For L1 cache,
Local MR = Number of misses in L1 / Number of accesses to L1
Global MR = Number of misses in L1 / Total number of references made by the processor
Total number of references made by the processor = Number of accesses to L1 (Since all memory references made by the processor are first served (tried to be served) by the L1 cache.
So, for L1, Local MR = Global MR
Statement 3: False, Correction - For a system of n hierarchical caches L1, L2,...,Ln
Explanation:
This statement is analogous to Statement 1. It means that for L2 in 2-level system, L3 in 3-level system, L4 in 4-level system and so on...
We proved that it is false for L2 in 2-level system in Statement 1. Same explanation follows for the rest.
Thus, for Ln in n-level hierarchical system, Local MR != Global MR
For typical x86 multicore processors, let us say, we have a processor with 2 cores and both cores encounter an L1 instruction cache miss when reading an instruction. Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines. Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
For typical x86 multicore processors, let us say, we have a processor with 2 cores
Ok, let use some early variant of Intel Core 2 Duo with two cores (Conroe). They have 2 CPU cores, 2 L1i caches and shared L2 cache.
and both cores encounter an L1 instruction cache miss when reading an instruction.
Ok, there will be miss in L1i to read next instruction (miss in L1d, when you access the data, works in similar way, but there are only reads from L1i and reads&writes from L1d). Each L1i with miss will generate request to next layer of memory hierarchy, to the L2 cache.
Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines.
Now we must to know how the caches are organized (This is classic middle-detail cache scheme which is logically similar to real hardware). Cache is memory array with special access circuits, and it looks like 2D array. We have many sets (64 in this picture) and each set has several ways. When we ask cache to get data from some address, the address is split into 3 parts: tag, set index and offset inside cache line. Set index is used to select the set (row in our 2D cache memory array), then tags in all ways are compared (to find right column in 2D array) with tag part of the request address, this is done in parallel by 8 tag comparators. If there is tag in cache equal to request address tag part, cache have "hit" and cache line from the selected cell will be returned to the requester.
Ways and sets; 2D array of cache (image from http://www.cnblogs.com/blockcipher/archive/2013/03/27/2985115.html or http://duartes.org/gustavo/blog/post/intel-cpu-caches/)
The example where set index 2 was selected, and parallel tag comparators give a "hit" (tag equality) for the Way 1:
What is the "port" to some memory or to cache? This is hardware interface between external hardware blocks and the memory, which has lines for request address (set by external block, for L1 it is set by CPU, for L2 - by L1), access type (load or store; may be fixed for the port), data input (for stores) and data output with ready bit (set by memory; cache logic handles misses too, so it return data both on hit and on miss, but it will return data for miss later).
If we want to increase true port count, we should increase hardware: for raw SRAM memory array we should add two transistor for every bit to increase port count by 1; for cache we should duplicate ALL tag comparator logic. But this has too high cost, so there are no much multiported memory in CPU, and if it has several ports, the total count of true ports is small.
But we can emulate having of several ports. http://web.eecs.umich.edu/~twenisch/470_F07/lectures/15.pdf EECS 470 2007 slide 11:
Parallel cache access is harder than parallel FUs
fundamental difference: caches have state, FUs don’t
one port affects future for other ports
Several approaches used
true multi‐porting
multiple cache copies
virtual multi‐porting
multi‐banking (interleaving)
line buffers
Multi-banking (sometimes called slicing) is used by modern chips ("Intel Core i7 has four banks in L1 and eight banks in L2"; figure 1.6 from page 9 of ISBN 1598297546 (2011) - https://books.google.com/books?id=Uc9cAQAAQBAJ&pg=PA9&lpg=PA9 ). It means, that there are several hardware caches of smaller sizes, and some bits of request address (part of set index - think the sets - rows as splitted over 8 parts or having colored into interleaved rows) are used to select bank. Each bank has low number of ports (1) and function just like classic cache (and there is full set of tag comparators in each bank; but the height of bank - number of sets in it is smaller, and every tag in array is routed only to single tag comparator - cheap as in single ported cache).
Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?
If two accesses are routed to different L2 banks (slices), then cache behave like multiported and can handle both requests at the same time. But if both are routed to the single bank with single port, they will be serialized for the cache. Cache serialization may cost several ticks and request will be stalled near port; CPU will see this as slightly more access latency.
This is a question on my exam study guide and we have not yet covered how to calculate data transfer. Any help would be greatly appreciated.
Given is an 8 way set associative level 2 data cache with a capacity of 2 MByte (1MByte = 2^20 Byte)
and a block size 128 Bytes. The cache is connected to the main memory by a shared 32 bit address and
data bus. The cache and the RISC-CPU are connected by a separated address and data bus, each with a
width of 32 bit. The CPU is executing a load word instruction
a) How much user data is transferred from the main memory to the cache in case of a cache miss?
b) How much user data is transferred from the cache to the CPU in case of a cache miss?
You need to compute first your cache line size:
Number of cache blocks: 2MB / 128B = 16384 blocks (14 bits)
Number of sets: 16384 / 8 way = 2048 sets (11 bits)
Address width: 32 bits
Line offset bits: 32 - 14 - 11 = 7 bits
So the cache line size is 128B - actually a line is a block but it's good to know the above computation.
a) How much user data is transferred from the main memory to the cache
in case of a cache miss?
In your problem, the L2 cache is the last level cache before main memory. So if you miss in the L2 cache (you don't find the line you are looking for), you need to fetch the line from main memory. So 128B of user data will be transferred from the main memory. The fact that the address bus and data bus are shared does not influence.
b) How much user data is transferred from the cache to the CPU in case
of a cache miss?
If you reached the L2 cache that means you missed the L1 cache. So from L2 the CPU has to transfer to L1 a full L1 cache line. So the L1 line size is 128B, then 128B of data will go from L2 to L1. The CPU will use then only a fraction of that line to feed the instruction that generated the miss into the L1 cache. Whether that line is evicted or not from L2, this should have been stated in the problem sentence (inclusive / exclusive cache)