index and tag switching position - caching

Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?

This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory.  Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block.  How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity.  If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2.  Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before.  This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.

Related

Cache memory: What is the difference between a tag and an index?

I read a lot of articles and watched videos explaining the concept of cache memory but I still can't get what the difference is between an index and a tag in the address format. They all say that we need an index because otherwise multiple locations would get hashed to the same location within the cache. But I don't understand. Can someone please explain?
Thank you.
An address as a simple number, usually taken as an unsigned integer:
+---------------------------+
| address |
+---------------------------+
The same address — the same overall number — is decomposed by the cache into piece parts called fields:
+----------------------------+
| tag | index | offset |
+----------------------------+
For any given cache, the tag width, index width, and offset width are in bits and are fixed, and, for any given address, each field, of course, has a value we can determine given that we know the address and the widths of the fields for the given cache.
Caches store replication of main memory in chunks called blocks.  To find the block address of some address, keep its tag and index bits as is, but set the block offset bits to zeros.
Let's say there are two addresses: A and B.  A has a tag, index and offset, as does B have a tag, index, and offset.
We want to know if A and B match to the level of the block of memory — which means we care about tag & index bits matching, but not about offset bits.
You can see from the above that two addresses can be different yet have the same index — many addresses will share the same index yet have different tag or different offset bits.
Now, let's say that B is an address known to be cached.  That means that the block of memory for B's tag and B's index is in the cache.  The whole block is in the cache, which is all address with the same tag & index, and any possible offset bits.
Let's say that A is some address the program wants to access.  The cache's job is to determine if A and B refer to the same block of memory, if so then since B is in the cache, access of A is a hit in the cache, while if A and B don't refer to the same block of memory, then there is a miss in the cache.
Caches employ a notion of an array.  They use the index positions for the elements of the array, to simplify their operation.  A simple (direct mapped) cache will have a block stored at each index position in the array (other caches will have more than one block stored at each index position in the array: this refers to the cache's set associativity level, number of "ways", as in 2-way or 4-way, etc..).  To find a desired element, A, we need to look in the cache.  This is done by taking A's index position and using it as the index in the cache array.  If the element already there has a block for address B, and B's tag stored there is the same tag value as A's, then both index position and tags match — index matches because we looked in the right place, and tags match because the cache stores B's tag and we have all of A so can compare A's tag with B's tag.
Such a cache will never store the block for an address at an index position different than the index position value for its address.  So, there is only one index position to look at to see if the cache stores the block associated with an address, A.
They all say that we need an index because otherwise multiple locations would get hashed to the same location within the cache
There is a degenerate case in cache architecture where the index size is 0 bits wide.  This means that regardless of the actual address, A, all addresses are stored at the same one index position.  Such as cache is "fully associative" and does not use the index field (or the index field has zero width).
The benefit of the index (when present in the cache architecture) is a simplification of the hardware: it has to only look at blocks stored at the index of A, and never at blocks stored at other indexes within the cache.
The benefit of not using an index is that one address will never evict another merely due to having the same index; fully associative caches are subject to less cache thrashing.

Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical

This question comes in context of a section on virtual memory in an undergraduate computer architecture course. Neither the teaching assistants nor the professor were able to answer it sufficiently, and online resources are limited.
Question:
Suppose a processor with the following specifications:
8KB pages
32-bit virtual addresses
28-bit physical addresses
a two-level page table, with a 1KB page table at the first level, and 8KB page tables at the
second level
4-byte page table entries
a 16-entry 8-way set associative TLB
in addition to the physical frame (page) number, page table entries contain a valid bit, a
readable bit, a writeable bit, an executable bit, and a kernel-only bit.
Now suppose this processor has a 32KB L1 cache whose tags are computed based on physical addresses. What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
Intuition:
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page, thus providing a small speed-up. However, I am unsure if this is the correct intuition and definitely don't know how to follow through with it. Could someone please explain this?
Note: I have computed the number of page table entries to be 2^19, if that helps anyone.
What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
They're only specified that the cache is physically tagged.
You can always build a virtually indexed cache, no minimum associativity. Even direct-mapped (1 way per set) works. See Cache Addressing Methods Confusion for details on VIPT vs. PIPT (and VIVT, and even the unusual PIVT).
For this question not to be trivial, I assume they also meant "without creating aliasing problems", so VIPT is just a speedup over PIPT (physically indexed, phyiscally tagged). You get the benefit of allowing TLB lookup in parallel with fetching tags (and data) for the ways of the indexed set without any downsides.
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page
You need the physical address to check against the tags; remember your cache is physically tagged. (Virtually tagged caches do exist, but typically have to get flushed on context switches to a process with different page tables = different virtual address space. This used to be used for small L1 caches on old CPUs.)
Having both numbers be a power of 2 is normally assumed, so they're always evenly divisible.
Page sizes are always a power of 2 so you can split an address into page number and offset-within-page by just taking different ranges of bits in the address.
Small/fast cache sizes also always have a power of 2 number of sets so the index "function" is just taking a range of bits from the address. For a virtually-indexed cache: from the virtual address. For a physically-indexed cache: from the physical address. (Outer caches like a big shared L3 cache may have a fancier indexing function, like a hash of more address bits, to avoid aliasing for addresses offset from each other by a large power of 2.)
The cache size might not be a power of 2, but you'd do that by having a non-power-of-2 associativity (e.g. 10 or 12 ways is not rare) rather than a non-power-of-2 line size or number of sets. After indexing a set, the cache fetches the tags for all the ways of that set and compare them in parallel. (And for fast L1 caches, often fetch the data selected by the line-offset bits in parallel, too, then the comparators just mux that data into the output, or raise a flag for no match.)
Requirements for VIPT without aliasing (like PIPT)
For that case, you need all index bits to come from below the page offset. They translate "for free" from virtual to physical so a VIPT cache (that indexes a set before TLB lookup) has no homonym/synonym problems. Other than performance, it's PIPT.
My detailed answer on Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? includes a section on that speed hack.
Virtually indexed physically tagged cache Synonym shows a case where the cache does not have that property, and needs page coloring by the OS to let avoid synonym problems.
How to compute cache bit widths for tags, indices and offsets in a set-associative cache and TLB has some more notes about cache size / associativity that give that property.
Formula:
min associativity = cache size / page size
e.g. a system with 8kiB pages needs a 32kiB L1 cache to be at least 4-way associative so that index bits only come from the low 13.
A direct-mapped cache (1 way per set) can only be as large as 1 page: byte-within-line and index bits total up to the byte-within-page offset. Every byte within a direct-mapped (1-way) cache must have a unique index:offset address, and those bits come from contiguous low bits of the full address.
To put it another way, 2^(idx_bits + within_line_bits) is the total cache size with only one way per set. 2^N is the page size, for a page offset of N (the number of byte-within-page address bits that translate for free).
The actual number of sets (in this case = lines) depends on the line size and page size. Using smaller / larger lines would just shift the divide between offset and index bits.
From there, the only way to make the cache bigger without indexing from higher address bits is to add more ways per set, not more ways.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Understanding caches and block sizes

A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.
Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.

Cache Memory Blocks Organization

I am not able to understand how exactly the cache is organized in the following scenario.
The cache size is 256 bytes. The cache line size is 8 bytes. All variables are 4 bytes. Assume that an array A[1024] is stored in memory locations 0-4095. Suppose if we are using fully associative mapping technique, how is the array mapped to this particular cache ? Consider that the cache is initially empty and we use LRU algorithm for replacement. During each replacement, an entire line of cache is replaced.
Initial analysis :
There will be 32 cache blocks each with 8 bytes length. But the variables to be stored in these locations is only 4 bytes long. I am not able to take this analysis any further as to how these array elements are mapped to the 32 cache blocks.
Let's assume it's accessed sequentially:
for (int i=0; i<1024; ++i)
read(A[i]);
In that case, you'll fill the first 64 elements (A[0] through A[63]) into the 32 cache blocks in adjacent pairs like MSalters said.
The next access would have to kick out the least recently used line, which, since you access the array in sequential order is A[64]. It would have to pick a victim to kick out, and since you're using LRU that would be the first block (way 0). You therefore replace A[0] and A[1] with A[64] and A[65] and so on, so in general you'll have element i mapped into way floor(i/2)%32.
Now computing the hit rate requires an additional assumption - each memory line fetched is the size of a full block (8 bytes), since you can't fill half blocks (actually there are ways using mask bits, but let's assume the simple case). We therefore get each second element "for free" - fetching A[0] would also fetch A[1] and so on. In theory this means that the hit rate could be 50% (miss even elements, hit odds, in reality most CPUs would perform the accesses in parallel so you won't really have that hit rate, but let's say the accesses are serialized here).
Note that each new block fetched after the first 64 elements would have to evict a block from the cache, if processing the elements also modifies them you'll have to write them back too.
Elements A[0] and A[1] are stored in adjacent memory locations, 0-4 and 4-8. That means they share the first cache block. The other elements are similarly mapped pairwise to a cache line. Which pair goes where?

Resources