I have read this blog and I am still unsure about the importance of locality. Why is locality important for cache performance? Is it because it leads to fewer cache misses? Furthermore, how is a program written in order to achieve good locality and hence good cache performance?
Caches are smaller, and usually much smaller, than the main memories they are associated with. For example, on x86 chips, the L1 cache is typically 32 KiB, while memory sizes of 32 GiB or larger are common, which is more than a million times larger.
Without spatial locality, memory requests would be uniformly distributed in the memory of the application, and then given the very large ratios between memory size and cache size, the chance of hitting in the cache would be microscopic (about one in one million for the example above). So the cache hit rate would be microscopic and the cache would be useless.
Related
Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.
I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.
Let's expect current i7/i9 CPUs (2018)
Any ideas?
Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).
Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)
Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.
Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.
Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.
Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.
Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.
To use cache memory, main memory is divided into cache lines, typically 32 or 64 bytes long. An entire cache line is cached at once. What is the advantage of caching an entire line instead of a single byte or word at a time?
This is done to exploit the principle of locality; spatial locality to be precise. This principle states that the data bytes which lie close together in memory are likely to be referenced together in a program. This is immediately apparent when accessing large arrays in loops. However, this is not always true (e.g. pointer based memory access) and hence it is not advisable to fetch data from memory at more than the granularity of cache lines (in case the program does not have locality of reference) since cache is a very limited and important resource.
Having cache block size equal to the smallest addressable size would mean, if a larger size access is supported, multiple tags would have to be checked for such larger accesses. While parallel tag checking is often used for set associative caches, a four-fold increase (8-bit compared to 32-bit) in the number of tags to check would increase access latency and greatly increase energy cost. In addition, such introduces the possibility of partial hits for larger accesses, increasing the complexity of sending the data to a dependent operation or internal storage. While data can be speculatively sent by assuming a full hit (so latency need not be hurt by the possibility of partial hits), the complexity budget is better not spent on supporting partial hits.
32-bit cache blocks, when the largest access size is 32 bits, would avoid the above-mentioned issues, but would use a significant fraction of storage for tags. E.g., a 16KiB direct-mapped cache in a 32-bit address space would use 18 bits for the address portion of the tag; even without additional metadata such as coherence state, tags would use 36% of the storage. (Additional metadata might be avoided by having a 16KiB region of the address space be non-cacheable; a tag matching this address region would be interpreted as "invalid".)
Besides the storage overhead, having more tag data tends to increase latency (smaller tag storage facilitates earlier way selection) and access energy. In addition, having a smaller number of blocks for a cache of a given size makes way prediction and memoization easier, these are used to reduce latency and/or access energy.
(The storage overhead can be a significant factor when it allows tags to be on chip while data is too large to fit on chip. If data uses a denser storage type than tags — e.g., data in DRAM and tags in SRAM with a four-fold difference in storage density —, lower tag overhead becomes more significant.)
If caches only exploited temporal locality (the reuse of a memory location within a "short" period of time), this would typically be the most attractive block size. However, spatial locality of access (accesses to locations near an earlier access often being close in time) is common. Taken control flow instructions are typically less than a sixth of all instructions and many branches and jumps are short (so the branch/jump target is somewhat likely to be within the same cache block as the branch/jump instruction if each cache block holds four or more instructions). Stack frames are local to a function (concentrating the timing of accesses, especially for leaf functions, which are common). Array accesses often use unit stride or very small strides. Members of a structure/object tend to be accessed nearby in time (conceptually related data tends to be related in action/purpose and so accessed nearer in time). Even some memory allocation patterns bias access toward spatial locality; related structures/objects are often allocated nearby in time — if the preferred free memory is not fragmented (which would happen if spatially local allocations are freed nearby in time, if little memory has been freed, or if the allocator is clever in reducing fragmentation, then such allocations are more likely to be spatially local.
With multiple caches, coherence overhead also tends to be reduced with larger cache blocks (under the assumption spatial locality). False sharing increases coherence overhead (similar to lack of spatial locality increasing capacity and conflict misses).
In this sense, larger cache blocks can be viewed as a simple form of prefetching (even with respect to coherence). Prefetching trades bandwidth and cache capacity for a reduction in latency via cache hits (as well as from increasing the useful queue size and scheduling flexibility). One could gain the same benefit by always prefetching a chunk of memory into multiple small cache blocks, but the capacity benefit of finer-grained eviction would be modest because spatial locality of use is common. In addition, to avoid prefetching data that is already in the cache, the tags for the other blocks would have to be probed to check for hits.
With simple modulo-power-of-two indexing and modest associativity, two spatially nearby blocks are more likely to conflict and evict earlier another blocks with spatial locality (index A and index B will have the same spatial locality relationship for all addresses mapping to indexes within a larger address range). With LRU-oriented replacement, accesses within a larger cache block reduce the chance of a too-early eviction when spatial locality is common at the cost of some capacity and conflict misses.
(For a direct-mapped cache, there is no difference between always prefetching a multi-block aligned chunk and using a larger cache block, so paying the extra tag overhead would be pointless.)
Prefetching into a smaller buffer would avoid cache pollution from used data, increasing the benefit of smaller block size, but such also reduces the temporal scope of the spatial locality. A four-entry prefetch buffer would only support spatial locality within four cache misses; this would catch most stream-like accesses (rarely will more than four new "streams" be active at the same time) and many other cases of spatial locality but some spatial locality is over a larger span of time.
Mandatory prefetching (whether from larger cache blocks or a more flexible mechanism) provides significant bandwidth advantages. First, the address and request type overhead is spread over a larger amount of data. 32 bits of address and request type overhead per 32 bit access uses 50% of the bandwidth for non-data but less than 12% when 256 bits of data are transferred.
Second, the memory controller processing and scheduling overhead can be more easily averaged over more transferred data.
Finally, DRAM chips can provide greater bandwidth by exploiting internal prefetch. Even in the days of Fast Page Mode DRAM, accesses within the same DRAM page were faster and higher bandwidth (less page precharge and activation overhead); while non-mandatory prefetch could exploit such and be more general, the control and communication overheads would be larger. Modern DRAMs have minimum burst lengths (and burst chop merely drops part of the DRAM-chip-internal prefetch — the internal access energy and array occupation are nor reduced).
The ideal cache block size depends on workload ('natural' algorithm choices and legacy optimization assumptions, data set sizes and complexity, etc.), cache sizes and associativity (larger and more associative caches encourage larger blocks), available bandwidth, use of in-cache data compression (which tends to encourage larger blocks), cache block sectoring (where validity/coherence state is tracked at finer granularity than the address), and other factors.
The main advantage of caching an entire line is the probability of the next cache-hit is increased.
From Tanenbaum's "Modern Operating Systems" book:
Cache-hit: When the program needs to read a memory word, the cache hardware checks to see if the line needed is in the cache.
If we don't have a cache-hit then cache-miss will occur. A memory request is sent to the main memory.
As a result, more time will be spent to complete the process, since searching inside the memory is costly.
We can tell that, caching an entire line will increase the probability of completing the process in two-cycles.
Note: I'm not sure if StackOverflow is the correct place for that question or if there is a more suitable StackExchange sub for this
I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size. I can totally undrestand how this works (or at least I think so) when used with quite simple architectures. Then I came accross this question. Question is how cache memories of the same cache line can cooperate?
This is how I percieve the way of cache memories with different cache line size work. For simplicity, lets suppose there are no different caches for data and for instructions and we only have L1 and L2 caches (L3 and L4 not exist).
If L1 has cache line size of 64 bytes and L2 of 128 bytes, when we have cache miss on L2 and we need to fetch the desired byte or word from main memory, we also bring its closest bytes or words in order to fill the 128 bytes of the L2 cache line. Then because of the locality of the references to memory locations produced by the processor we have higher probability of geting a hit on L2 whe missing on L1. But if we had equal cache line sizes this of course wouldn't happen, with the previous algorithm. Can you explain me some sort/simple algorithm or implementation of how modern CPUs take advantage of caches having the same line size?
Thanks in advance.
I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size.
That's not true for most CPUs. Usually the line size is the same in all caches, but the total size increases. Often also the associativity, but usually not by as much as the total size, so the number of sets typically increases.
The point of multi-level caches is to get low latency and large size without needing a single cache that's both large and low latency (because that's physically impossible).
HW prefetch into L2 and/or L1 is what makes sequential read work well, not larger line size in out levels of cache. (And in multi-core CPUs, private L1/L2 + shared L3 provide private latency + bandwidth filters for the memory workload hits the shared domain, but then you have L3 as a coherency backstop instead of hitting DRAM for data that's shared between cores.)
Having different line sizes in different caches is more complicated, especially in a multi-core system where caches have to maintain coherency with each other using MESI. Transferring around whole cache between caches works well.
But if if L1D lines are 64B and private L2 / shared L3 lines are 128B, then a load on one core might force the L2 cache to request both halves separately in case separate cores had each of the two halves of the 128B line modified. Sounds really complicated, and puts more logic into the outer-level cache.
(Paul Clayton's answer on the question you linked points out that a possible solution to that problem is separate validity bits for the two halves of a larger cache line, or even separate MESI coherency state. But still sharing the same tag, so if they are both valid then they have to be caching two halves of the same 128B block.)
Does anyone know why in most of todays processors there are several layers of caches. Like L1 L2 and L3. Why cant a processor do with one big L1 cache?
Isnt having multiple layers of cache increases the complexity of caching protocols?
Die size. L1 is usually on-die; there is not room for a large cache on-die. L2/3 gets its own die and can be bigger and processed differently.
Also speed; L1 is built with tradeoffs for maximum speed, while L2/3 doesn't have to be as aggressively sped up.
Also multi-core. Modern multi-core processors give each core its own L1 for speed, but they share some or all of the other caches for coherency.
That said, PA-RISC processors have been built with the "let's just make a big L1 cache" approach. They were expensive.
Why cant a processor do with one big L1 cache?
The larger your processor cache, the longer the latency. There are also practical and cost considerations, since larger caches occupy more physical space on a chip. After a certain size, you lose too much of the caching speedup to make it worth it to increase cache size further. Eventually, therefore, a large cache becomes undesirable.
Processor designs that still want a large cache can make a tradeoff by having multiple cache levels. You start with a small and fast cache, and gradually fall back to larger, slower caches on successive misses.
B/c in today's architectures you have more than one CPU/core accessing the memory. The L3 cache is a cache of caches that is shared between all the CPUs. This reduces the amount of data that needs to go through the memory bus, which is usually a good idea. If you want, you can have a look at : https://imgur.com/gallery/aBKD0Fv which shows how the layers are organized and how did they evolve through time.
I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is,
Spatial Locality of reference
Temporal Locality of reference
This would not matter if your computer was filled with super-fast memory.
But unfortunately that's not the case and computer-memory looks something like this1:
+----------+
| CPU | <<-- Our beloved CPU, superfast and always hungry for more data.
+----------+
|L1 - Cache| <<-- ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput
+----------+
|L2 - Cache| <<-- ~12 CPU-cycles access latency (fast)
+----+-----+
|
+----------+
|L3 - Cache| <<-- ~35 CPU-cycles access latency (medium)
+----+-----+ (usually shared between CPU-cores)
|
| <<-- This thin wire is the memory bus, it has limited bandwidth.
+----+-----+
| main-mem | <<-- ~100 CPU-cycles access latency (slow)
+----+-----+ <<-- The main memory is big but slow (because we are cheap-skates)
|
| <<-- Even slower wire to the harddisk
+----+-----+
| harddisk | <<-- Works at 0,001% of CPU speed
+----------+
Spatial Locality
In this diagram, the closer data is to the CPU the faster the CPU can get at it.
This is related to Spacial Locality. Data has spacial locality if it is located close together in memory.
Because of the cheap-skates that we are RAM is not really Random Access, it is really Slow if random, less slow if accessed sequentially Access Memory SIRLSIAS-AM. DDR SDRAM transfers a whole burst of 32 or 64 bytes for one read or write command.
That is why it is smart to keep related data close together, so you can do a sequential read of a bunch of data and save time.
Temporal locality
Data stays in main-memory, but it cannot stay in the cache, or the cache would stop being useful. Only the most recently used data can be found in the cache; old data gets pushed out.
This is related to temporal locality. Data has strong temporal locality if it is accessed at the same time.
This is important because if item A is in the cache (good) than Item B (with strong temporal locality to A) is very likely to also be in the cache.
Footnote 1:
This is a simplification with latency cycle counts estimated from various cpus for example purposes, but give you the right order-of-magnitude idea for typical CPUs.
In reality latency and bandwidth are separate factors, with latency harder to improve for memory farther from the CPU. But HW prefetching and/or out-of-order exec can hide latency in some cases, like looping over an array. With unpredictable access patterns, effective memory throughput can be much lower than 10% of L1d cache.
For example, L2 cache bandwidth is not necessarily 3x worse than L1d bandwidth. (But it is lower if you're using AVX SIMD to do 2x 32-byte loads per clock cycle from L1d on a Haswell or Zen2 CPU.)
This simplified version also leaves out TLB effects (page-granularity locality) and DRAM-page locality. (Not the same thing as virtual memory pages). For a much deeper dive into memory hardware and tuning software for it, see What Every Programmer Should Know About Memory?
Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why a multi-level cache hierarchy is necessary to get the combination of latency/bandwidth and capacity (and hit-rate) we want.
One huge fast L1-data cache would be prohibitively power-expensive, and still not even possible with as low latency as the small fast L1d cache in modern high-performance CPUs.
In multi-core CPUs, L1i/L1d and L2 cache are typically per-core private caches, with a shared L3 cache. Different cores have to compete with each other for L3 and memory bandwidth, but each have their own L1 and L2 bandwidth. See How can cache be that fast? for a benchmark result from a dual-core 3GHz IvyBridge CPU: aggregate L1d cache read bandwidth on both cores of 186 GB/s vs. 9.6 GB/s DRAM read bandwidth with both cores active. (So memory = 10% L1d for single-core is a good bandwidth estimate for desktop CPUs of that generation, with only 128-bit SIMD load/store data paths). And L1d latency of 1.4 ns vs. DRAM latency of 72 ns
It is a principle which states that if some variables are referenced
by a program, it is highly likely that the same might be referenced
again (later in time - also known as temporal locality) .
It is also highly likely that any consecutive storage in memory might
be referenced sooner (spatial locality)
First of all, note that these concepts are not universal laws, they are observations about common forms of code behavior that allow CPU designers to optimize their system to perform better over most of the programs. At the same time, these are properties that programmers seek to adopt in their programs as they know that's how memory systems are built and that's what CPU designers optimize for.
Spatial locality refers to the property of some (most, actually) applications to access memory in a sequential or strided manner. This usually stems from the fact that the most basic data structure building blocks are arrays and structs, both of which store multiple elements adjacently in memory. In fact, many implementations of data structures that are semantically linked (graphs, trees, skip lists) are using arrays internally to improve performance.
Spatial locality allows a CPU to improve the memory access performance thanks to:
Memory caching mechanisms such as caches, page tables, memory controller page are already larger by design than what is needed for a single access. This means that once you pay the memory penalty for bringing data from far memory or a lower level cache, the more additional data you can consume from it the better is your utilization.
Hardware prefetching which exists on almost all CPUs today often covers spatial accesses. Everytime you fetch addr X, the prefetcher will likely fetch the next cache line, and possibly others further ahead. If the program exhibits a constant stride, most CPUs would be able to detect that as well and extrapolate to prefetch even further steps of the same stride. Modern spatial prefetchers may even predict variable recurring strides (e.g. VLDP, SPP)
Temporal locality refers to the property of memory accesses or access patterns to repeat themselves. In the most basic form this could mean that if address X was once accessed it may also be accessed in the future, but since caches already store recent data for a certain duration this form is less interesting (although there are mechanisms on some CPUs aimed to predict which lines are likely to be accessed again soon and which are not).
A more interesting form of temporal locality is that two (or more) temporally adjacent accesses observed once, may repeat together again. That is - if you once accessed address A and soon after that address B, and at some later point the CPU detects another access to address A - it may predict that you will likely access B again soon, and proceed to prefetch it in advance.
Prefetchers aimed to extract and predict this type of relations (temporal prefetchers) are often using relatively large storage to record many such relations. (See Markov prefetching, and more recently ISB, STMS, Domino, etc..)
By the way, these concepts are in no way exclusive, and a program can exhibit both types of localities (as well as other, more irregular forms). Sometimes both are even grouped together under the term spatio-temporal locality to represent the "common" forms of locality, or a combined form where the temporal correlation connects spatial constructs (like address delta always following another address delta).
Temporal locality of reference - A memory location that has been used recently is more likely to be accessed again. For e.g., Variables in a loop. Same set of variables (symbolic name for a memory locations) being used for some i number of iterations of a loop.
Spatial locality of reference - A memory location that is close to the currently accessed memory location is more likely to be accessed. For e.g., if you declare int a,b; float c,d; the compiler is likely to assign them consecutive memory locations. So if a is being used then it is very likely that b, c or d will be used in near future. This is one way how cachelines of 32 or 64 bytes, help. They are not of size 4 or 8 bytes (typical size of int,float, long and double variables).