How does cache associativity impact performance [duplicate] - performance

This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?

Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

Related

How is AMD's micro-tagged L1 data cache accessed?

I am learning about the access process of L1 cache of AMD processor. But I read AMD's manual repeatedly, and I still can't understand it.
My understanding of L1 data cache with Intel is:
L1 cache is virtual Indexed and physical tagged. Therefore, use the index bits of the virtual address to find the corresponding cache set, and finally determine which cache line in the cache set is based on the tag.
(Intel makes their L1d caches associative enough and small enough that the index bits come only from the offset-within-page which is the same in the physical address. So they get the speed of VIPT with none of the aliasing problems, behaving like PIPT.)
But AMD used a new method. In Zen 1, they have a 32-Kbyte, 8-way set associative L1d cache, which (unlike the 64KB 4-way L1i) is small enough to avoid aliasing problems without micro-tags.
From AMD's 2017 Software Optimization Manual, section 2.6.2.2 "Microarchitecture of AMD Family 17h Processor" (Zen 1):
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially. Loads use this utag to determine which
way of the cache to read using their linear address, which is
available before the load's physical address has been determined via
the TLB. The utag is a hash of the load's linear address. This linear
address based lookup enables a very accurate prediction of in which
way the cacheline is located prior to a read of the cache data. This
allows a load to read just a single cache way, instead of all 8. This
saves power and reduces bank conflicts.
It is possible for the utag to
be wrong in both directions: it can predict hit when the access will
miss, and it can predict miss when the access could have hit. In
either case, a fill request to the L2 cache is initiated and the utag
is updated when L2 responds to the fill request.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
It is also possible for two different
linear addresses that are NOT aliased to the same physical address to
conflict in the utag, if they have the same linear hash. At a given L1
DC index (11:6), only one cacheline with a given linear hash is
accessible at any time; any cachelines with matching linear hashes are
marked invalid in the utag and are not accessible.
It is possible for the utag to be wrong in both directions
What is the specific scenario of this sentence in the second paragraph? Under what circumstances will hit be predicted as miss and miss as hit?
When the CPU accesses data from the memory to the cache, it will calculate a cache way based on utag. And just put it here? Even if the other cache way are empty?
Linear aliasing occurs when two different linear addresses are mapped to the same physical address.
How can different linear addresses map to the same physical address?
However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.
What does this sentence mean? My understanding is to first calculate the utag based on the linear address (virtual address) to determine which cache way to use. Then use the tag field of the physical address to determine whether it is a cache hit? How is utag updated? Will it be recorded in the cache?
any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.
What does this sentence mean?
How does AMD judge cache hit or miss? Why are some hits regarded as misses? Can someone explain? Many thanks!
The L1 data cache tags contain a linear-address-based microtag (utag)
that tags each cacheline with the linear address that was used to
access the cacheline initially.
Each cache line in the L1D has a utag associated with it. This implies the utag memory structure is organized exactly like the L1D (i.e., 8 ways and 64 sets) and there is a one-to-one correspondence between the entries. The utag is calculated based on the linear address of the request that caused the line to be filled in the L1D.
Loads use this utag to determine which way of the cache to read using
their linear address, which is available before the load's physical
address has been determined via the TLB.
The linear address of a load is sent simultaneously to the way predictor and the TLB (it's better to use the term MMU, since there are multiple TLBs). A particular set in the utag memory is selected using certain bits of the linear address (11:6) and all of the 8 utags in that set are read at the same time. Meanwhile, a utag is calculated based on the linear address of the load request. When both of these operations complete, the given utag is compared against all the utags stored in the set. The utag memory is maintained such that there can be at most one utag in each set with the same value. In case of a hit in the utag memory, the way predictor predicts that the target cache line is in the corresponding cache entry in the L1D. Up until this point, the physical address is not yet needed.
The utag is a hash of the load's linear address.
The hash function was reverse-engineered in the paper titled Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors in Section 3 for a number of microarchitectures. Basically, certain bits of the linear address at positions 27:12 are XOR'ed with each other to produce an 8-bit value, which is the utag. A good hash function should: (1) minimize the number of linear address pairs that map to the same utag, (2) minimize the size of the utag, and (3) have a latency not larger than the utag memory access latency.
This linear address based lookup enables a very accurate prediction of
in which way the cacheline is located prior to a read of the cache
data. This allows a load to read just a single cache way, instead of
all 8. This saves power and reduces bank conflicts.
Besides the utag memory and associated logic, the L1D also includes a tag memory and a data memory, all have the same organization. The tag memory stores physical tags (bit 6 up to the highest bit of the physical address). The data memory stores cache lines. In case of a hit in the utag, the way predictor reads only one entry in the corresponding way in the tag memory and data memory. The size of a physical address is more than 35 bits on modern x86 processors, and so the size of a physical tag is more than 29 bits. This is more than 3x larger than the size of a utag. Without way prediction, in a cache with more than one cache way, multiple tags would have to be read and compared in parallel. In an 8-way cache, reading and comparing 1 tag consumes much less energy than reading and comparing 8 tags.
In a cache where each way can be activated separately, each cache entry has its own wordline, which is shorter compared to a worldline shared across multiple cache ways. Due to signal propagation delays, reading a single way takes less time than reading 8 ways. However, in a parallelly-accessed cache, there is no way prediction delay, but linear address translation becomes on the critical path of the load latency. With way prediction, the data from the predicted entry can be speculatively forwarded to dependent uops. This can provide a significant load latency advantage, especially since linear address translation latency can vary due to the multi-level design of the MMU, even in the typical case of an MMU hit. The downside is that it introduces a new reason why replays may occur: in case of a misprediction, tens or even hundreds of uops may need to be replayed. I don't know if AMD actually forwards the requested data before validating the prediction, but it's possible even though not mentioned in the manual.
Reduction of bank conflicts is another advantage of way prediction as mentioned in the manual. This implies that different ways are placed in different banks. Section 2.6.2.1 says that bits 5:2 of the address, the size of the access, and the cache way number determine the banks to be accessed. This suggests there are 16*8 = 128 banks, one bank for each 4-byte chunk in each way. Bits 5:2 are obtained from the linear address of the load, the size of the load is obtained from the load uop, and the way number is obtained from the way predictor. Section 2.6.2 says that the L1D supports two 16-byte loads and one 16-byte store in the same cycle. This suggests that each bank has a single 16-byte read-write port. Each of the 128 bank ports are connected through an interconnect to each of the 3 ports of the data memory of the L1D. One of the 3 ports are connected to the store buffer and the other two are connected to the load buffer, possibly with intermediary logic for efficiently handling cross-line loads (single load uop but two load requests whose results are merged), overlapping loads (to avoid bank conflicts), and loads that cross bank boundaries.
The fact that way prediction requires accessing only a single way in the tag memory and the data memory of the L1D allows reducing or completely eliminating the need (depending on how snoops are handled) to make the tag and data memories truly multiported (which is the approach Intel has followed in Haswell), while still achieving about the same throughput. Bank conflicts can still occur, though, when there are simultaneous accesses to the same way and identical 5:2 address bits, but different utags. Way prediction does reduce bank conflicts because it doesn't require reading multiple entries (at least in the tag memory, but possibly also in the data memory) for each access, but it doesn't completely eliminate bank conflicts.
That said, the tag memory may require true multiporting to handle fill checks (see later), validation checks (see later), snooping, and "normal path" checks for non-load accesses. I think only load requests use the way predictor. Other types of requests are handled normally.
A highly accurate L1D hit/miss prediction can have other benefits too. If a load is predicted to miss in the L1D, the scheduler wakeup signal for dependent uops can be suppressed to avoid likely replays. In addition, the physical address, as soon as it's available, can be sent early to the L2 cache before fully resolving the prediction. I don't know if these optimizations are employed by AMD.
It is possible for the utag to be wrong in both directions: it can
predict hit when the access will miss, and it can predict miss when
the access could have hit. In either case, a fill request to the L2
cache is initiated and the utag is updated when L2 responds to the
fill request.
On an OS that supports multiple linear address spaces or allows synonyms in the same address space, cache lines can only be identified uniquely using physical addresses. As mentioned earlier, when looking up a utag in the utag memory, there can either be one hit or zero hits. Consider first the hit case. This linear address-based lookup results in a speculative hit and still needs to be verified. Even if paging is disabled, a utag is still not a unique substitute to a full address. As soon as the physical address is provided by the MMU, the prediction can be validated by comparing the physical tag from the predicted way with the tag from the physical address of the access. One of the following cases can occur:
The physical tags match and the speculative hit is deemed a true hit. Nothing needs to be done, except possibly triggering a prefetch or updating the replacement state of the line.
The physical tags don't match and the target line doesn't exist in any of the other entries of the same set. Note that the target line cannot possibly exist in other sets because all of the L1D memories use the same set indexing function. I'll discuss how this is handled later.
The physical tags don't match and the target line does exist in another entry of the same set (associated with a different utag). I'll discuss how this is handled later.
If no matching utag was found in the utag memory, there will be no physical tag to compare against because no way is predicted. One of the following cases can occur:
The target line actually doesn't exist in the L1D, so the speculative miss is a true miss. The line has to be fetched from somewhere else.
The target line actually exists in the same set but with a different utag. I'll discuss how this is handled later.
(I'm making two simplifications here. First, the load request is assumed to be to cacheable memory. Second, on a speculative or true hit in the L1D, there are no detected errors in the data. I'm trying to stay focused on Section 2.6.2.2.)
Accessing the L2 is needed only in cases 3 and 5 and not in cases 2 and 4. The only way to determine which is the case is by comparing the physical tag of the load with the physical tags of all present lines in the same set. This can be done either before or after accessing the L2. Either way, it has to be done to avoid the possibility of having multiple copies of the same line in the L1D. Doing the checks before accessing the L2 improves the latency in cases 3 and 5, but hurts it in cases 2 and 4. Doing the checks after accessing the L2 improves the latency in cases 2 and 4, but hurts it in cases 3 and 5. It's possible to both perform the checks and send a request to the L2 at the same time. But this may waste energy and L2 bandwidth in cases 3 and 5. It seems that AMD decided to do the checks after the line is fetched from the L2 (which is inclusive of the L1 caches).
When the line arrives from the L2, the L1D doesn't have to wait until it gets filled in it to respond with the requested data, so a higher fill latency is tolerable. The physical tags are now compared to determine which of the 4 cases has occurred. In case 4, the line is filled in the data memory, tag memory, and utag memory in the way chosen by the replacement policy. In case 2, the requested line replaces the existing line that happened to have the same utag and the replacement policy is not engaged to chose a way. This happens even if there was a vacant entry in the same set, essentially reducing the effective capacity of the cache. In case 5, the utag can simply be overwritten. Case 3 is a little complicated because it involves an entry with a matching physical tag and a different entry with a matching utag. One of them will have to be invalidated and the other will have to be replaced. A vacant entry can also exist in this case and not utilized.
Linear aliasing occurs when two different linear addresses are mapped
to the same physical address. This can cause performance penalties for
loads and stores to the aliased cachelines. A load to an address that
is valid in the L1 DC but under a different linear alias will see an
L1 DC miss, which requires an L2 cache request to be made. The latency
will generally be no larger than that of an L2 cache hit. However, if
multiple aliased loads or stores are in-flight simultaneously, they
each may experience L1 DC misses as they update the utag with a
particular linear address and remove another linear address from being
able to access the cacheline.
This is how case 5 (and case 2 to a lesser extent) can occur. Linear aliasing can occur within the same linear address space and across different address spaces (context switching and hyperthreading effects come into play).
It is also possible for two different linear addresses that are NOT
aliased to the same physical address to conflict in the utag, if they
have the same linear hash. At a given L1 DC index (11:6), only one
cacheline with a given linear hash is accessible at any time; any
cachelines with matching linear hashes are marked invalid in the utag
and are not accessible.
This is how cases 2 and 3 can occur and they're handled as discussed earlier. This part tells that the L1D uses the simple set indexing function; the set number is bits 11:6.
I think huge pages make cases 2 and 3 more likely to occur because more than half of the bits used by the utag hash function become part of the page offset rather than page number. Physical memory shared between multiple OS processes makes case 5 more likely.

VIPT Cache: Connection between TLB & Cache?

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.
At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Size of neighbouring data a modern computer caches for locality favour

I have a continuous memory of 1024 buffers, each buffer sizes 2K bytes. I use a linked list to keep record of available buffers (Buffer here can be thought of being used by Producer and Consumer). After some operations, the order of buffers in the link list becomes random.
The modern computer architecture favours compact data, locality a lot. It caches neighbouring data when a location needs to be accessed. The cache-line of my computer is 64(corrected from 64K) bytes.
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Question 2. What is the size of neighbouring data a modern computer caches? I think if you access a location in an array of integers, it will cache neighbouring integers. But my unit data (2K) is much larger than int (4). So, I am not sure how many neighbours will be cached.
First of all I doubt that "The cache-line of my computer is 64K bytes". It's most likely to be 64 Bytes only. Let me try to answer your questions:
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Not necessarily. It depends on how many operations you do on a buffer once it is cached.
So if you cache a 2K buffer and do lots of sequential work on it your
cache hit rate would be good. As Paul suggested, this works even better with hardware prefetching enabled.
However if you constantly jump between buffers and do relatively
low amount of work on each buffer, the cache hit rate will drop.
However 1024 x 2KB = 2MB so that could be a size for an L2 cache (if you have also L3, then L2 is generally smaller). So even
if you miss L1, there's a high chance that in both cases you will
hit L2.
Question 2. What is the size of neighbouring data a modern computer caches?
Usually the number of neighbors fetched is given by the cache line size. If the line size is 64B, you could fetch 16 integer values. So on each read, you fill a cache line. However you need to take into consideration prefetching. If your CPU detects that the memory reads are sequential, it will prefetch more neighbors and bring more cache lines in advance.
Hope this helps!

How is an LRU cache implemented in a CPU?

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?
Also what happens in a dual core system where both CPUs write to the one address simultaneously?
For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.
For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.
For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.
On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.
Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.
For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.
With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.
Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.
There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.

optimal memory layout for read-only/write memory segments

Suppose I have two memory segments (equal size each, approximately 1kb in size) , one is read-only (after initialization), and other is read/write.
what is the best layout in memory for such segments in terms of memory performance? one allocation, contiguous segments or two allocations (in general not contiguous). my primary architecture is linux Intel 64-bit.
my feeling is former (cache friendlier) case is better.
is there circumstances, where second layout is preferred?
I would put the 2KB of data in the middle of a 4KB page, to avoid interference from reads and writes close to the page boundary. Similarly, keeping the write data separate is also good idea for the same reason.
Having contiguous read/write blocks may be less effiicent than keeping them separate. For example, a cache that is storing data for code interested in just the read-only portion may become invalidated by a write from another cpu. The cache line will be invalidated and refreshed, even though the code wasn't reading the writable data. By keeping the blocks separate, you avoid this case, and writes to the writable data block only invalidate cache lines for the writable block, and do not interfere with cache lines for the read only block.
Note that this is only a concern at the block boundary between the readable and writable blocks. If your block sizes were much larger than the cache line size, then this would be a peripheral problem, but as your blocks are small, requiring just a few cache lines, then the problem of invalidating lines could be significant.
With that small of data, it really shouldn't matter much. Both of those arrays will fit into any level cache just fine.
It'll depend on what you're doing with the memory. I'm fairly certain that contiguous (and page aligned!) would never be slower than two randomly placed segments, but it won't necessarily be any faster.
Given that it's an Intel processor, you probably only need to ensure that the addresses are not exactly a multiple of 64k apart. If they are, loads from either section that map to the same modulo 64k address will collide in L1 and cause an L1 miss. There's also a 4MB aliasing issue, but I'd be surprised if you ran into that.

Resources