How does Direct mapped cache implement spatial locality? - caching

An array is declared as floatA[2048] . Each array element is 4Bytes in size.This program is run on a computer that has a
direct mapped data cache of size 8Kbytes, with block (line) size of 16Bytes.
Which elements of the array conflict with element A[0] in the data cache?
Ultimately A[0],A[512],A[1024],A[1536] map to cache block 0
As per my understanding, when A[0] is required for the first time, A[0],A[1],A[2],A[3](since one cache block can hold 4 elements) are brought into the cache and placed in cache blocks 0, 1, 2,and 3 respectively.
Other approach would be to bring only A[0] and place it in cache block 0. (Spatial locality not used here)
What is the general practice in such a scenario?

All four elements A[3:0] are stored in cache block 0 - since these 4 elements together form 16B. Depending on how the hardware system is set up, the next 16B are then stored on cache block 1 (the decision of which cache line (16B contiguous data granule) maps into which set is made while designing hardware and is based on certain bits of the address.

Related

index and tag switching position

Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?
This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory.  Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block.  How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity.  If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2.  Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before.  This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.

Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical

This question comes in context of a section on virtual memory in an undergraduate computer architecture course. Neither the teaching assistants nor the professor were able to answer it sufficiently, and online resources are limited.
Question:
Suppose a processor with the following specifications:
8KB pages
32-bit virtual addresses
28-bit physical addresses
a two-level page table, with a 1KB page table at the first level, and 8KB page tables at the
second level
4-byte page table entries
a 16-entry 8-way set associative TLB
in addition to the physical frame (page) number, page table entries contain a valid bit, a
readable bit, a writeable bit, an executable bit, and a kernel-only bit.
Now suppose this processor has a 32KB L1 cache whose tags are computed based on physical addresses. What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
Intuition:
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page, thus providing a small speed-up. However, I am unsure if this is the correct intuition and definitely don't know how to follow through with it. Could someone please explain this?
Note: I have computed the number of page table entries to be 2^19, if that helps anyone.
What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
They're only specified that the cache is physically tagged.
You can always build a virtually indexed cache, no minimum associativity. Even direct-mapped (1 way per set) works. See Cache Addressing Methods Confusion for details on VIPT vs. PIPT (and VIVT, and even the unusual PIVT).
For this question not to be trivial, I assume they also meant "without creating aliasing problems", so VIPT is just a speedup over PIPT (physically indexed, phyiscally tagged). You get the benefit of allowing TLB lookup in parallel with fetching tags (and data) for the ways of the indexed set without any downsides.
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page
You need the physical address to check against the tags; remember your cache is physically tagged. (Virtually tagged caches do exist, but typically have to get flushed on context switches to a process with different page tables = different virtual address space. This used to be used for small L1 caches on old CPUs.)
Having both numbers be a power of 2 is normally assumed, so they're always evenly divisible.
Page sizes are always a power of 2 so you can split an address into page number and offset-within-page by just taking different ranges of bits in the address.
Small/fast cache sizes also always have a power of 2 number of sets so the index "function" is just taking a range of bits from the address. For a virtually-indexed cache: from the virtual address. For a physically-indexed cache: from the physical address. (Outer caches like a big shared L3 cache may have a fancier indexing function, like a hash of more address bits, to avoid aliasing for addresses offset from each other by a large power of 2.)
The cache size might not be a power of 2, but you'd do that by having a non-power-of-2 associativity (e.g. 10 or 12 ways is not rare) rather than a non-power-of-2 line size or number of sets. After indexing a set, the cache fetches the tags for all the ways of that set and compare them in parallel. (And for fast L1 caches, often fetch the data selected by the line-offset bits in parallel, too, then the comparators just mux that data into the output, or raise a flag for no match.)
Requirements for VIPT without aliasing (like PIPT)
For that case, you need all index bits to come from below the page offset. They translate "for free" from virtual to physical so a VIPT cache (that indexes a set before TLB lookup) has no homonym/synonym problems. Other than performance, it's PIPT.
My detailed answer on Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? includes a section on that speed hack.
Virtually indexed physically tagged cache Synonym shows a case where the cache does not have that property, and needs page coloring by the OS to let avoid synonym problems.
How to compute cache bit widths for tags, indices and offsets in a set-associative cache and TLB has some more notes about cache size / associativity that give that property.
Formula:
min associativity = cache size / page size
e.g. a system with 8kiB pages needs a 32kiB L1 cache to be at least 4-way associative so that index bits only come from the low 13.
A direct-mapped cache (1 way per set) can only be as large as 1 page: byte-within-line and index bits total up to the byte-within-page offset. Every byte within a direct-mapped (1-way) cache must have a unique index:offset address, and those bits come from contiguous low bits of the full address.
To put it another way, 2^(idx_bits + within_line_bits) is the total cache size with only one way per set. 2^N is the page size, for a page offset of N (the number of byte-within-page address bits that translate for free).
The actual number of sets (in this case = lines) depends on the line size and page size. Using smaller / larger lines would just shift the divide between offset and index bits.
From there, the only way to make the cache bigger without indexing from higher address bits is to add more ways per set, not more ways.

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4 cache lines.
How many cache ways do I have? I am not even sure what a cache way is. Can someone explain that? I have done some searching, the best example was
http://download.intel.com/design/intarch/papers/cache6.pdf
But I am still confused.
Thanks.
The cache you are referring to is known as set associative cache. The whole cache is divided into sets and each set contains 4 cache lines(hence 4 way cache). So the relationship stands like this :
cache size = number of sets in cache * number of cache lines in each set * cache line size
Your cache size is 32KB, it is 4 way and cache line size is 32B. So the number of sets is
(32KB / (4 * 32B)) = 256
If we think of the main memory as consisting of cache lines, then each memory region of one cache line size is called a block. So each block of main memory will be mapped to a cache line (but not always to a particular cache line, as it is set associative cache).
In set associative cache, each memory block will be mapped to a fixed set in the cache. But it can be stored in any of the cache lines of the set. In your example, each memory block can be stored in any of the 4 cache lines of a set.
Memory block to cache line mapping
Number of blocks in main memory = (1GB / 32B) = 2^25
Number of blocks in each page = (4KB / 32B) = 128
Each byte address in the system can be divided into 3 parts:
Rightmost bits represent byte offset within a cache line or block
Middle bits represent to which cache set this byte(or cache line) will be mapped
Leftmost bits represent tag value
Bits needed to represent 1GB of memory = 30 (1GB = (2^30)B)
Bits needed to represent offset in cache line = 5 (32B = (2^5)B)
Bits needed to represent 256 cache sets = 8 (2^8 = 256)
So that leaves us with (30 - 5 - 8) = 17 bits for tag. As different memory blocks can be mapped to same cache line, this tag value helps in differentiating among them.
When an address is generated by the processor, 8 middle bits of the 30 bit address is used to select the cache set. There will be 4 cache lines in that set. So tags of the all four resident cache lines are checked against the tag of the generated address for a match.
Example
If a 30 bit address is 00000000000000000-00000100-00010('-' separated for clarity), then
offset within the cache is 2
set number is 4
tag is 0
In their "Computer Organization and Design, the Hardware-Software Interface", Patterson and Hennessy talk about caches. For example, in this version, page 408 shows the following image (I have added blue, red, and green lines):
Apparently, the authors use only the term "block" (and not the "line") when they describe set-associative caches. In a direct-mapped cache, the "index" part of the address addresses the line. In a set-associative, it indexes the set.
This visualization should get along well with #Soumen's explanation in the accepted answer.
However, the book mainly describes Reduced Instruction Set Architectures (RISC). I am personally aware of MIPS and RISC-V versions. So, if you have an x86 in front of you, take this picture with a grain of salt, more as a concept visualization than as actual implementation.
If we divide the memory into cache line sized chunks(i.e. 32B chunks of memory), each of this chunks is called a block. Now when you try to access some memory address, the whole memory block(size 32B) containing that address will be placed to a cache line.
No each set is not responsible for 4096KB or one particular memory page. Multiple memory blocks from different memory pages can be mapped to same cache set.

Cache Memory Blocks Organization

I am not able to understand how exactly the cache is organized in the following scenario.
The cache size is 256 bytes. The cache line size is 8 bytes. All variables are 4 bytes. Assume that an array A[1024] is stored in memory locations 0-4095. Suppose if we are using fully associative mapping technique, how is the array mapped to this particular cache ? Consider that the cache is initially empty and we use LRU algorithm for replacement. During each replacement, an entire line of cache is replaced.
Initial analysis :
There will be 32 cache blocks each with 8 bytes length. But the variables to be stored in these locations is only 4 bytes long. I am not able to take this analysis any further as to how these array elements are mapped to the 32 cache blocks.
Let's assume it's accessed sequentially:
for (int i=0; i<1024; ++i)
read(A[i]);
In that case, you'll fill the first 64 elements (A[0] through A[63]) into the 32 cache blocks in adjacent pairs like MSalters said.
The next access would have to kick out the least recently used line, which, since you access the array in sequential order is A[64]. It would have to pick a victim to kick out, and since you're using LRU that would be the first block (way 0). You therefore replace A[0] and A[1] with A[64] and A[65] and so on, so in general you'll have element i mapped into way floor(i/2)%32.
Now computing the hit rate requires an additional assumption - each memory line fetched is the size of a full block (8 bytes), since you can't fill half blocks (actually there are ways using mask bits, but let's assume the simple case). We therefore get each second element "for free" - fetching A[0] would also fetch A[1] and so on. In theory this means that the hit rate could be 50% (miss even elements, hit odds, in reality most CPUs would perform the accesses in parallel so you won't really have that hit rate, but let's say the accesses are serialized here).
Note that each new block fetched after the first 64 elements would have to evict a block from the cache, if processing the elements also modifies them you'll have to write them back too.
Elements A[0] and A[1] are stored in adjacent memory locations, 0-4 and 4-8. That means they share the first cache block. The other elements are similarly mapped pairwise to a cache line. Which pair goes where?

How does direct mapped cache work?

I am taking a System Architecture course and I have trouble understanding how a direct mapped cache works.
I have looked in several places and they explain it in a different manner which gets me even more confused.
What I cannot understand is what is the Tag and Index, and how are they selected?
The explanation from my lecture is:
"Address divided is into two parts
index (e.g 15 bits) used to address (32k) RAMs directly
Rest of address, tag is stored and compared with incoming tag. "
Where does that tag come from? It cannot be the full address of the memory location in RAM since it renders direct mapped cache useless (when compared with the fully associative cache).
Thank you very much.
Okay. So let's first understand how the CPU interacts with the cache.
There are three layers of memory (broadly speaking) - cache (generally made of SRAM chips), main memory (generally made of DRAM chips), and storage (generally magnetic, like hard disks). Whenever CPU needs any data from some particular location, it first searches the cache to see if it is there. Cache memory lies closest to the CPU in terms of memory hierarchy, hence its access time is the least (and cost is the highest), so if the data CPU is looking for can be found there, it constitutes a 'hit', and data is obtained from there for use by CPU. If it is not there, then the data has to be moved from the main memory to the cache before it can be accessed by the CPU (CPU generally interacts only with the cache), that incurs a time penalty.
So to find out whether the data is there or not in the cache, various algorithms are applied. One is this direct mapped cache method. For simplicity, let's assume a memory system where there are 10 cache memory locations available (numbered 0 to 9), and 40 main memory locations available (numbered 0 to 39). This picture sums it up:
There are 40 main memory locations available, but only upto 10 can be accommodated in the cache. So now, by some means, the incoming request from CPU needs to be redirected to a cache location. That has two problems:
How to redirect? Specifically, how to do it in a predictable way which will not change over time?
If the cache location is already filled up with some data, the incoming request from CPU has to identify whether the address from which it requires the data is same as the address whose data is stored in that location.
In our simple example, we can redirect by a simple logic. Given that we have to map 40 main memory locations numbered serially from 0 to 39 to 10 cache locations numbered 0 to 9, the cache location for a memory location n can be n%10. So 21 corresponds to 1, 37 corresponds to 7, etc. That becomes the index.
But 37, 17, 7 all correspond to 7. So to differentiate between them, comes the tag. So just like index is n%10, tag is int(n/10). So now 37, 17, 7 will have the same index 7, but different tags like 3, 1, 0, etc. That is, the mapping can be completely specified by the two data - tag and index.
So now if a request comes for address location 29, that will translate to a tag of 2 and index of 9. Index corresponds to cache location number, so cache location no. 9 will be queried to see if it contains any data, and if so, if the associated tag is 2. If yes, it's a CPU hit and the data will be fetched from that location immediately. If it is empty, or the tag is not 2, it means that it contains the data corresponding to some other memory address and not 29 (although it will have the same index, which means it contains a data from address like 9, 19, 39, etc.). So it is a CPU miss, and data from location no. 29 in main memory will have to be loaded into the cache at location 9 (and the tag changed to 2, and deleting any data which was there before), after which it will be fetched by CPU.
Lets use an example. A 64 kilobyte cache, with 16 byte cache-lines has 4096 different cache lines.
You need to break the address down into three different parts.
The lowest bits are used to tell you the byte within a cache line when you get it back, this part isn't directly used in the cache lookup. (bits 0-3 in this example)
The next bits are used to INDEX the cache. If you think of the cache as a big column of cache lines, the index bits tell you which row you need to look in for your data. (bits 4-15 in this example)
All the other bits are TAG bits. These bits are stored in the tag store for the data you have stored in the cache, and we compare the corresponding bits of the cache request to what we have stored to figure out if the data we are cacheing are the data that are being requested.
The number of bits you use for the index is log_base_2(number_of_cache_lines) [it's really the number of sets, but in a direct mapped cache, there are the same number of lines and sets]
A direct mapped cache is like a table that has rows also called cache line and at least 2 columns one for the data and the other one for the tags.
Here is how it works: A read access to the cache takes the middle part of the address that is called index and use it as the row number. The data and the tag are looked up at the same time.
Next, the tag needs to be compared with the upper part of the address to decide if the line is from the same address range in memory and is valid. At the same time, the lower part of the address can be used to select the requested data from cache line (I assume a cache line can hold data for several words).
I emphasized a little on data access and tag access+compare happens at the same time, because that is key to reduce the latency (purpose of a cache). The data path ram access doesn't need to be two steps.
The advantage is that a read is basically a simple table lookup and a compare.
But it is direct mapped that means for every read address there is exactly one place in the cache where this data could be cached. So the disadvantage is that a lot of other addresses would be mapped to the same place and may compete for this cache line.
I have found a good book at the library that has offered me the clear explanation I needed and I will now share it here in case some other student stumbles across this thread while searching about caches.
The book is "Computer Architecture - A Quantitative Approach" 3rd edition by Hennesy and Patterson, page 390.
First, keep in mind that the main memory is divided into blocks for the cache.
If we have a 64 Bytes cache and 1 GB of RAM, the RAM would be divided into 128 KB blocks (1 GB of RAM / 64B of Cache = 128 KB Block size).
From the book:
Where can a block be placed in a cache?
If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The destination block is calculated using this formula: <RAM Block Address> MOD <Number of Blocks in the Cache>
So, let's assume we have 32 blocks of RAM and 8 blocks of cache.
If we want to store block 12 from RAM to the cache, RAM block 12 would be stored into Cache block 4. Why? Because 12 / 8 = 1 remainder 4. The remainder is the destination block.
If a block can be placed anywhere in the cache, the cache is said to be fully associative.
If a block can be placed anywhere in a restricted set of places in the cache, the cache is set associative.
Basically, a set is a group of blocks in the cache. A block is first mapped onto a set and then the block can be placed anywhere inside the set.
The formula is: <RAM Block Address> MOD <Number of Sets in the Cache>
So, let's assume we have 32 blocks of RAM and a cache divided into 4 sets (each set having two blocks, meaning 8 blocks in total). This way set 0 would have blocks 0 and 1, set 1 would have blocks 2 and 3, and so on...
If we want to store RAM block 12 into the cache, the RAM block would be stored in the Cache blocks 0 or 1. Why? Because 12 / 4 = 3 remainder 0. Therefore set 0 is selected and the block can be placed anywhere inside set 0 (meaning block 0 and 1).
Now I'll go back to my original problem with the addresses.
How is a block found if it is in the cache?
Each block frame in the cache has an address. Just to make it clear, a block has both address and data.
The block address is divided into multiple pieces: Tag, Index and Offset.
The tag is used to find the block inside the cache, the index only shows the set in which the block is situated (making it quite redundant) and the offset is used to select the data.
By "select the data" I mean that in a cache block there will obviously be more than one memory locations, the offset is used to select between them.
So, if you want to imagine a table, these would be the columns:
TAG | INDEX | OFFSET | DATA 1 | DATA 2 | ... | DATA N
Tag would be used to find the block, index would show in which set the block is, offset would select one of the fields to its right.
I hope that my understanding of this is correct, if it is not please let me know.

Resources