Related
I am learning about the access process of L1 cache of AMD processor. But I read AMD's manual repeatedly, and I still can't understand it.
My understanding of L1 data cache with Intel is:
L1 cache is virtual Indexed and physical tagged. Therefore, use the index bits of the virtual address to find the corresponding cache set, and finally determine which cache line in the cache set is based on the tag.
(Intel makes their L1d caches associative enough and small enough that the index bits come only from the offset-within-page which is the same in the physical address. So they get the speed of VIPT with none of the aliasing problems, behaving like PIPT.)
But AMD used a new method. In Zen 1, they have a 32-Kbyte, 8-way set associative L1d cache, which (unlike the 64KB 4-way L1i) is small enough to avoid aliasing problems without micro-tags.
From AMD's 2017 Software Optimization Manual, section 2.6.2.2 "Microarchitecture of AMD Family 17h Processor" (Zen 1):
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially. Loads use this utag to determine which
way of the cache to read using their linear address, which is
available before the load's physical address has been determined via
the TLB. The utag is a hash of the load's linear address. This linear
address based lookup enables a very accurate prediction of in which
way the cacheline is located prior to a read of the cache data. This
allows a load to read just a single cache way, instead of all 8. This
saves power and reduces bank conflicts.
It is possible for the utag to
be wrong in both directions: it can predict hit when the access will
miss, and it can predict miss when the access could have hit. In
either case, a fill request to the L2 cache is initiated and the utag
is updated when L2 responds to the fill request.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
It is also possible for two different
linear addresses that are NOT aliased to the same physical address to
conflict in the utag, if they have the same linear hash. At a given L1
DC index (11:6), only one cacheline with a given linear hash is
accessible at any time; any cachelines with matching linear hashes are
marked invalid in the utag and are not accessible.
It is possible for the utag to be wrong in both directions
What is the specific scenario of this sentence in the second paragraph? Under what circumstances will hit be predicted as miss and miss as hit?
When the CPU accesses data from the memory to the cache, it will calculate a cache way based on utag. And just put it here? Even if the other cache way are empty?
Linear aliasing occurs when two different linear addresses are mapped to the same physical address.
How can different linear addresses map to the same physical address?
However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.
What does this sentence mean? My understanding is to first calculate the utag based on the linear address (virtual address) to determine which cache way to use. Then use the tag field of the physical address to determine whether it is a cache hit? How is utag updated? Will it be recorded in the cache?
any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.
What does this sentence mean?
How does AMD judge cache hit or miss? Why are some hits regarded as misses? Can someone explain? Many thanks!
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially.
Each cache line in the L1D has a utag associated with it. This implies the utag memory structure is organized exactly like the L1D (i.e., 8 ways and 64 sets) and there is a one-to-one correspondence between the entries. The utag is calculated based on the linear address of the request that caused the line to be filled in the L1D.
Loads use this utag to determine which way of the cache to read using
their linear address, which is available before the load's physical
address has been determined via the TLB.
The linear address of a load is sent simultaneously to the way predictor and the TLB (it's better to use the term MMU, since there are multiple TLBs). A particular set in the utag memory is selected using certain bits of the linear address (11:6) and all of the 8 utags in that set are read at the same time. Meanwhile, a utag is calculated based on the linear address of the load request. When both of these operations complete, the given utag is compared against all the utags stored in the set. The utag memory is maintained such that there can be at most one utag in each set with the same value. In case of a hit in the utag memory, the way predictor predicts that the target cache line is in the corresponding cache entry in the L1D. Up until this point, the physical address is not yet needed.
The utag is a hash of the load's linear address.
The hash function was reverse-engineered in the paper titled Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors in Section 3 for a number of microarchitectures. Basically, certain bits of the linear address at positions 27:12 are XOR'ed with each other to produce an 8-bit value, which is the utag. A good hash function should: (1) minimize the number of linear address pairs that map to the same utag, (2) minimize the size of the utag, and (3) have a latency not larger than the utag memory access latency.
This linear address based lookup enables a very accurate prediction of
in which way the cacheline is located prior to a read of the cache
data. This allows a load to read just a single cache way, instead of
all 8. This saves power and reduces bank conflicts.
Besides the utag memory and associated logic, the L1D also includes a tag memory and a data memory, all have the same organization. The tag memory stores physical tags (bit 6 up to the highest bit of the physical address). The data memory stores cache lines. In case of a hit in the utag, the way predictor reads only one entry in the corresponding way in the tag memory and data memory. The size of a physical address is more than 35 bits on modern x86 processors, and so the size of a physical tag is more than 29 bits. This is more than 3x larger than the size of a utag. Without way prediction, in a cache with more than one cache way, multiple tags would have to be read and compared in parallel. In an 8-way cache, reading and comparing 1 tag consumes much less energy than reading and comparing 8 tags.
In a cache where each way can be activated separately, each cache entry has its own wordline, which is shorter compared to a worldline shared across multiple cache ways. Due to signal propagation delays, reading a single way takes less time than reading 8 ways. However, in a parallelly-accessed cache, there is no way prediction delay, but linear address translation becomes on the critical path of the load latency. With way prediction, the data from the predicted entry can be speculatively forwarded to dependent uops. This can provide a significant load latency advantage, especially since linear address translation latency can vary due to the multi-level design of the MMU, even in the typical case of an MMU hit. The downside is that it introduces a new reason why replays may occur: in case of a misprediction, tens or even hundreds of uops may need to be replayed. I don't know if AMD actually forwards the requested data before validating the prediction, but it's possible even though not mentioned in the manual.
Reduction of bank conflicts is another advantage of way prediction as mentioned in the manual. This implies that different ways are placed in different banks. Section 2.6.2.1 says that bits 5:2 of the address, the size of the access, and the cache way number determine the banks to be accessed. This suggests there are 16*8 = 128 banks, one bank for each 4-byte chunk in each way. Bits 5:2 are obtained from the linear address of the load, the size of the load is obtained from the load uop, and the way number is obtained from the way predictor. Section 2.6.2 says that the L1D supports two 16-byte loads and one 16-byte store in the same cycle. This suggests that each bank has a single 16-byte read-write port. Each of the 128 bank ports are connected through an interconnect to each of the 3 ports of the data memory of the L1D. One of the 3 ports are connected to the store buffer and the other two are connected to the load buffer, possibly with intermediary logic for efficiently handling cross-line loads (single load uop but two load requests whose results are merged), overlapping loads (to avoid bank conflicts), and loads that cross bank boundaries.
The fact that way prediction requires accessing only a single way in the tag memory and the data memory of the L1D allows reducing or completely eliminating the need (depending on how snoops are handled) to make the tag and data memories truly multiported (which is the approach Intel has followed in Haswell), while still achieving about the same throughput. Bank conflicts can still occur, though, when there are simultaneous accesses to the same way and identical 5:2 address bits, but different utags. Way prediction does reduce bank conflicts because it doesn't require reading multiple entries (at least in the tag memory, but possibly also in the data memory) for each access, but it doesn't completely eliminate bank conflicts.
That said, the tag memory may require true multiporting to handle fill checks (see later), validation checks (see later), snooping, and "normal path" checks for non-load accesses. I think only load requests use the way predictor. Other types of requests are handled normally.
A highly accurate L1D hit/miss prediction can have other benefits too. If a load is predicted to miss in the L1D, the scheduler wakeup signal for dependent uops can be suppressed to avoid likely replays. In addition, the physical address, as soon as it's available, can be sent early to the L2 cache before fully resolving the prediction. I don't know if these optimizations are employed by AMD.
It is possible for the utag to be wrong in both directions: it can
predict hit when the access will miss, and it can predict miss when
the access could have hit. In either case, a fill request to the L2
cache is initiated and the utag is updated when L2 responds to the
fill request.
On an OS that supports multiple linear address spaces or allows synonyms in the same address space, cache lines can only be identified uniquely using physical addresses. As mentioned earlier, when looking up a utag in the utag memory, there can either be one hit or zero hits. Consider first the hit case. This linear address-based lookup results in a speculative hit and still needs to be verified. Even if paging is disabled, a utag is still not a unique substitute to a full address. As soon as the physical address is provided by the MMU, the prediction can be validated by comparing the physical tag from the predicted way with the tag from the physical address of the access. One of the following cases can occur:
The physical tags match and the speculative hit is deemed a true hit. Nothing needs to be done, except possibly triggering a prefetch or updating the replacement state of the line.
The physical tags don't match and the target line doesn't exist in any of the other entries of the same set. Note that the target line cannot possibly exist in other sets because all of the L1D memories use the same set indexing function. I'll discuss how this is handled later.
The physical tags don't match and the target line does exist in another entry of the same set (associated with a different utag). I'll discuss how this is handled later.
If no matching utag was found in the utag memory, there will be no physical tag to compare against because no way is predicted. One of the following cases can occur:
The target line actually doesn't exist in the L1D, so the speculative miss is a true miss. The line has to be fetched from somewhere else.
The target line actually exists in the same set but with a different utag. I'll discuss how this is handled later.
(I'm making two simplifications here. First, the load request is assumed to be to cacheable memory. Second, on a speculative or true hit in the L1D, there are no detected errors in the data. I'm trying to stay focused on Section 2.6.2.2.)
Accessing the L2 is needed only in cases 3 and 5 and not in cases 2 and 4. The only way to determine which is the case is by comparing the physical tag of the load with the physical tags of all present lines in the same set. This can be done either before or after accessing the L2. Either way, it has to be done to avoid the possibility of having multiple copies of the same line in the L1D. Doing the checks before accessing the L2 improves the latency in cases 3 and 5, but hurts it in cases 2 and 4. Doing the checks after accessing the L2 improves the latency in cases 2 and 4, but hurts it in cases 3 and 5. It's possible to both perform the checks and send a request to the L2 at the same time. But this may waste energy and L2 bandwidth in cases 3 and 5. It seems that AMD decided to do the checks after the line is fetched from the L2 (which is inclusive of the L1 caches).
When the line arrives from the L2, the L1D doesn't have to wait until it gets filled in it to respond with the requested data, so a higher fill latency is tolerable. The physical tags are now compared to determine which of the 4 cases has occurred. In case 4, the line is filled in the data memory, tag memory, and utag memory in the way chosen by the replacement policy. In case 2, the requested line replaces the existing line that happened to have the same utag and the replacement policy is not engaged to chose a way. This happens even if there was a vacant entry in the same set, essentially reducing the effective capacity of the cache. In case 5, the utag can simply be overwritten. Case 3 is a little complicated because it involves an entry with a matching physical tag and a different entry with a matching utag. One of them will have to be invalidated and the other will have to be replaced. A vacant entry can also exist in this case and not utilized.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
This is how case 5 (and case 2 to a lesser extent) can occur. Linear aliasing can occur within the same linear address space and across different address spaces (context switching and hyperthreading effects come into play).
It is also possible for two different linear addresses that are NOT
aliased to the same physical address to conflict in the utag, if they
have the same linear hash. At a given L1 DC index (11:6), only one
cacheline with a given linear hash is accessible at any time; any
cachelines with matching linear hashes are marked invalid in the utag
and are not accessible.
This is how cases 2 and 3 can occur and they're handled as discussed earlier. This part tells that the L1D uses the simple set indexing function; the set number is bits 11:6.
I think huge pages make cases 2 and 3 more likely to occur because more than half of the bits used by the utag hash function become part of the page offset rather than page number. Physical memory shared between multiple OS processes makes case 5 more likely.
This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?
Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.
If we put the striped locks very close to each other in memory for a concurrent hashmap, the cache line size can affect performance because we would have to invalidate caches unnecessarily. If you add padding to the array of striped locks, it will improve performance.
Can someone explain this?
To start with a non-concurrent hashmap, the basic principle is this:
Have a indexed structure (most often an array or set of arrays) for the keys and values.
Get a hash for the key.
Reduce this to be within the size of the arrays. (Modulo does this simply enough, so if the hash value is 123439281 and there are 31 slots available, then we use 123439281 % 31 which is 9 and use that as our index).
See if there's a key there, and if so if it matches (equals).
Store the key if it's new, and the value.
The same approach can be used to find the value for a given key (or to find that there is none).
Of course the above doesn't work if there's a key in the same slot that isn't equal to the key you're concerned with, and there are different approaches to dealing with this, mainly either continuing to look in different slots until one is free, or having slots actually act as a linked list of equal-indexed pairs. I won't go into the details of this.
If you are looking to other slots it won't work once you've filled the arrays (and will be slow before that point) and if you are using linked-lists to handle collisions you will be very slow if you have many keys at the same index (the O(1) you want becomes closer and closer to O(n) as this gets worse). Either way you're going to want a mechanism to resize the internal store when the amount stored gets too large.
Okay. That's a very high-level description of a hashmap. What if we want to make it threadsafe?
It won't be threadsafe by default as e.g. two different threads writing different keys whose hash modulo down to the same value, then one thread might stomp over the other.
The simplest way to make a hashmap threadsafe is to simply have a lock that we use on all operations. That means that every thread will wait on every other thread though, so it won't have very good concurrent behaviour. With a bit of clever structuring it's not too hard to have it that we can have multiple reading threads or a single writing thread (but not both), but that still isn't great.
It's possible to create hashmaps that are safely concurrent without using locks at all (see http://www.communicraft.com/blog/details/a-lock-free-dictionary for a description of one I wrote in C#, or http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable for one in Java by Cliff Click whose basic approach I used in my C# version).
But another approach is striped locks.
Because the basis of the map is either an array for key-value pairs (and likely a cached copy of the hashcode) or a set of arrays for them, and because it is generally safe to have two threads writing and/or reading to different parts of an array at a time (there are caveats, but I'll ignore them for now) the only problems are when either two threads want the same slot, or when a resize is necessary.
And therefore the different slots could have different locks, and then only threads that are operating on the same slot would need to wait on each other.
There'd still be the problem of resizing, but that isn't insurmountable; if you need to resize obtain every one of the locks (in a set order, so that you can prevent deadlocks from happening) and then do the resize, having first checked that no other thread did the same resize in the meantime.
However, if you've a hashmap with 10,000 slots this would mean 10,000 lock objects. That's a lot of memory used, and also a resize would mean obtaining every one of those 10,000 locks.
Striped locks are somewhere in-between the single-lock approach and the lock-per-slot approach. Have an array of a certain number of locks, say 16 as a nice (binary) round number. When you need to act on a slot then obtain lock number slotIndex % 16, and do your operation. Now while threads may still end up blocking on threads doing operations on completely different slots (slot 5 and slot 21 have the same lock) they can still act concurrently to many other operations, so it's a middle-ground between the two extremes.
So that's how striped locking works, at a high level.
Now, modern day memory access is not uniform, in that it does not take the same time to access arbitrary pieces of memory because there is a level of caching (generally at least 2 levels) in the CPU. This caching has both good and bad effects.
Obviously the good effects normally outweigh the bad, or chip manufacturers wouldn't use it. If you access a piece of memory, and then access a piece of memory very close to it, the chances are that second access will be very fast because it will have been loaded into the cache on the first read. Writes are also improved.
It's already natural enough that a given piece of code is likely to want to do several operations on blocks of memory close to each other (e.g. reading two fields in the same object, or two locals in a method), which is why this sort of caching worked in the first place. And programmers further work to take advantage of this fact as much as possible in how they design their code, and collections such as hashmaps are a classic example. E.g. we might have stored keys and stored hashes in the same array so that reading one brings the other into the cache to be quickly read, and so on.
There are though times when this caching has a negative effect. In particular if two threads are going to deal with bits of memory that are close to each other at around the same time.
This doesn't come up that often, because threads are most often dealing with their own stacks or bits of heap memory pointed to by their own stacks, and only occasionally heap memory that is visible to other threads. That in itself is a big part of why CPU caches are normally a big win for performance.
However, the use of concurrent hashmaps is inherently a case where multiple threads hit neighbouring blocks of memory.
CPU caches work on the basis of "cache lines". These are blocks of code that are loaded into the cache from the RAM, or written from the cache to the RAM as a unit. (Again, while we're about to discuss a case where this is a bad thing, this is an efficient model most of the time).
Now, consider a 64-bit processor with 64-byte cache-lines. Every pointer or reference to an object is going to take up 8 bytes. If a piece of code tries to access such a reference it will mean that 64 bytes are loaded into the cache, then 8 bytes of that dealt with by the CPU. If the CPU writes to that memory, then those 8 bytes are changed on the cache, and the cache written back to the RAM. As said, this is generally good, because the odds are high that we'll also want to do the same with other bits of RAM nearby, and hence in the same cache line.
But what if another thread wants to hit the same block of memory?
If CPU0 goes to read from a value that is in the same cachline that CPU1 has just written to, it will have a stale cacheline that has been invalidated and have to read it again. If CPU0 was trying to write to it it may well not only have to read it again, but redo the operation that gave it the result to write.
Now, if that other thread had wanted to hit the exact same bit of memory, there'd have been a conflict even without caching, so things aren't that much worse than they would have been (but they are worse). But if the other thread was going to hit nearby memory it will still suffer.
This is obviously bad for our concurrent map's slots, but its even worse for its striped locks. We'd said we might have 16 locks. With 64-byte cachelines and 64-bit references that's 2 cachelines for all the locks. The odds a lock is in the same cacheline as that wanted by the other thread is 50%. With 128-byte cachelines (Itanium has those) or 32-bit references (all 32-bit code uses those) it's 100%. With lots of threads its effectively 100% that you're going to be waiting. And waiting again if there's yet another hit. And waiting.
Our attempt to prevent threads waiting on the same lock has turned into them waiting on the same cacheline.
Worse, the more cores you have using the locks, the worse this becomes. Each extra core slows down the total throughput roughly exponentially. 8 cores might take over 200 times as long to execute as 1 core would!
If however we pad out our striped locks with blank space so that there is a 56-byte gap between each one, then this doesn't happen; the locks are all on different cachelines, and operations on neighbouring locks don't affect it any more. This costs memory, and makes normal reading and writing slower (the point of caches is that it makes things faster most of the time after all), but is appropriate in cases where particularly frequent concurrent access is expected, and we're not likely to want to hit the next lock (we aren't, except for resize operations). (Another example would be striped counters; have different threads increment different integers and sum them when you want to get the tally).
This problem of threads hitting neighbouring pieces of memory (called "false-sharing" because it has a performance impact caused by shared access to the same memory even though they are actually accessing neighbouring memory rather than the same memory) will also affect the internal storage of the hashmap itself, but not as much because the map itself is likely larger and so the odds of two accesses hitting the same cacheline is less. It would also be more expensive to use padding here for the same reason; being larger the amount of padding that would involve could be huge.
I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)
As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..
The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).
For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.
The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.
I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.
I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).
Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.
The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.
One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.
A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold
Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.
I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.
Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.
To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)
There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf
Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.
Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.