When is it not possible to exploit spatial locality in cache?

When is it not possible to exploit spatial locality in cache? - caching

We are given a processor whose instructions operate on 8 - byte operands and whose instructions are also encoded using 8 bytes. We are using a 16 kilo-byte,
4-way set associative cache that contains 1024 sets. The cache has 4 * 1024 = 4096 cache lines in total. That means, each cache line is 16KB=4096 = 4 bytes, so each operand and each instruction needs to be stored in two cache lines, which will require 2 accesses to the cache for each load/store operation and instruction fetches. We are told that the cache cannot apply spatial locality, but why? What does spatial locality mean in this context?

So your machine doesn't have single byte loads/stores at all?
If every load is a full cache line (or multiple whole cache lines), then bringing a line into cache for one load will never benefit another load that's spatially nearby.
(Unless you have hardware prefetching to detect sequential accesses and start fetching adjacent cache lines...)

Related

Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical

This question comes in context of a section on virtual memory in an undergraduate computer architecture course. Neither the teaching assistants nor the professor were able to answer it sufficiently, and online resources are limited.
Question:
Suppose a processor with the following specifications:
8KB pages
32-bit virtual addresses
28-bit physical addresses
a two-level page table, with a 1KB page table at the first level, and 8KB page tables at the
second level
4-byte page table entries
a 16-entry 8-way set associative TLB
in addition to the physical frame (page) number, page table entries contain a valid bit, a
readable bit, a writeable bit, an executable bit, and a kernel-only bit.
Now suppose this processor has a 32KB L1 cache whose tags are computed based on physical addresses. What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
Intuition:
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page, thus providing a small speed-up. However, I am unsure if this is the correct intuition and definitely don't know how to follow through with it. Could someone please explain this?
Note: I have computed the number of page table entries to be 2^19, if that helps anyone.

What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
They're only specified that the cache is physically tagged.
You can always build a virtually indexed cache, no minimum associativity. Even direct-mapped (1 way per set) works. See Cache Addressing Methods Confusion for details on VIPT vs. PIPT (and VIVT, and even the unusual PIVT).
For this question not to be trivial, I assume they also meant "without creating aliasing problems", so VIPT is just a speedup over PIPT (physically indexed, phyiscally tagged). You get the benefit of allowing TLB lookup in parallel with fetching tags (and data) for the ways of the indexed set without any downsides.
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page
You need the physical address to check against the tags; remember your cache is physically tagged. (Virtually tagged caches do exist, but typically have to get flushed on context switches to a process with different page tables = different virtual address space. This used to be used for small L1 caches on old CPUs.)
Having both numbers be a power of 2 is normally assumed, so they're always evenly divisible.
Page sizes are always a power of 2 so you can split an address into page number and offset-within-page by just taking different ranges of bits in the address.
Small/fast cache sizes also always have a power of 2 number of sets so the index "function" is just taking a range of bits from the address. For a virtually-indexed cache: from the virtual address. For a physically-indexed cache: from the physical address. (Outer caches like a big shared L3 cache may have a fancier indexing function, like a hash of more address bits, to avoid aliasing for addresses offset from each other by a large power of 2.)
The cache size might not be a power of 2, but you'd do that by having a non-power-of-2 associativity (e.g. 10 or 12 ways is not rare) rather than a non-power-of-2 line size or number of sets. After indexing a set, the cache fetches the tags for all the ways of that set and compare them in parallel. (And for fast L1 caches, often fetch the data selected by the line-offset bits in parallel, too, then the comparators just mux that data into the output, or raise a flag for no match.)
Requirements for VIPT without aliasing (like PIPT)
For that case, you need all index bits to come from below the page offset. They translate "for free" from virtual to physical so a VIPT cache (that indexes a set before TLB lookup) has no homonym/synonym problems. Other than performance, it's PIPT.
My detailed answer on Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? includes a section on that speed hack.
Virtually indexed physically tagged cache Synonym shows a case where the cache does not have that property, and needs page coloring by the OS to let avoid synonym problems.
How to compute cache bit widths for tags, indices and offsets in a set-associative cache and TLB has some more notes about cache size / associativity that give that property.
Formula:
min associativity = cache size / page size
e.g. a system with 8kiB pages needs a 32kiB L1 cache to be at least 4-way associative so that index bits only come from the low 13.
A direct-mapped cache (1 way per set) can only be as large as 1 page: byte-within-line and index bits total up to the byte-within-page offset. Every byte within a direct-mapped (1-way) cache must have a unique index:offset address, and those bits come from contiguous low bits of the full address.
To put it another way, 2^(idx_bits + within_line_bits) is the total cache size with only one way per set. 2^N is the page size, for a page offset of N (the number of byte-within-page address bits that translate for free).
The actual number of sets (in this case = lines) depends on the line size and page size. Using smaller / larger lines would just shift the divide between offset and index bits.
From there, the only way to make the cache bigger without indexing from higher address bits is to add more ways per set, not more ways.

Understanding Direct Mapped Cache

I'm trying to understand direct mapped cache, but it is a very complex concept. I have written what I think I understand so far, but I am unsure whether I am correct or not. Can somebody please verify if the explanation below is correct?
E.g, for a made up computer, just for the sake of this question, there 1024 memory locations (cells) in the RAM. This equals 2^10 so the address for each of these memory locations must be 10 bits long.
The CPU is asked to get data from the RAM memory address 1100100111. However the CPU doesn't access the data directly from this memory address in the RAM. The RAM stores this data to cache memory and then the CPU gets the data from the cache memory.
There are different ways of doing this, one being direct mapped cache. The cache memory and ram memory are divided up into blocks, where the number of cells in the blocks in each memory must be the same. The number of blocks in the RAM and cache must also be a power of 2.
In this example lets say there are 2^6 = 64 blocks in the RAM, so there are 1024/64 = 16 cells in each block. Lets say there are 2^2 = 4 blocks in the cache, so the cache has 64 cells. The "6" and "2" in the exponents of these numbers are important later on.
Because the The number of blocks in the RAM and cache is a power of 2, it makes the calculations easy. In our address 1100100111 the last 6 bits mark the offset 100111 (the 6 comes from the fact that 2^6 = 64), and the remaining 4 bits 1100 mark the RAM block number the data is stored in. Within this block number are two other important numbers. First the cache block number; this is the cache block that that RAM block would store to. This is the first 2 bits after the offset, so it will be 00 (The 2 comes from the fact that There are 2^2 = 4 blocks in the cache). The remaining 2 numbers in the address mark the tag. This will be 11.
So when the CPU is asked to get data from memory address 1100100111 it will look for this data in cache block number 00. It will compare the tag of the address 11 to the tag saved in the cache, which is a separate piece of memory used to store information about where from the RAM the data has come from. If the tags are the same this is a hit and this is the data the CPU is looking for. If the tag of the address and the tag in the memory are different, then this is a miss, and the data isn't stored in the cache.
If this is the case, the cache controller will get the data from block number 1100 in the RAM and store it in the cache block number 00, and update the tag in this block to 11. The CPU can now get the data in this block.
Is this all correct? I need to understand this before I can start to try and understand associative and set associative memory.
Thanks!

You have the right idea, but your numbers went wrong somewhere. In your example you have a direct-mapped cache of 4 blocks/lines of 16 bytes/cells each. The address 1100100111 will be divided up as follows. You use the least significant four bits 0111 as the offset because it refers to which cell of a particular block you want. I think you accidentally included the block number as part of the offset. Anyway, the next least significant two bits 10 will be the block number and the most significant four bits 1100 will be the tag.
Your understanding seems to be fine. One thing more that is necessary is a bit to indicate if the cache block is valid or not. Good luck with the associative stuff!

How much data is loaded in to the L2 and L3 caches?

If I have this class:
class MyClass{
short a;
short b;
short c;
};
and I have this code performing calculations on the above:
std::vector<MyClass> vec;
//
for(auto x : vec){
sum = vec.a * (3 + vec.b) / vec.c;
}
I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).
How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?

L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.
For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)
IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)
Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.

It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.
Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.
It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4 cache lines.
How many cache ways do I have? I am not even sure what a cache way is. Can someone explain that? I have done some searching, the best example was
http://download.intel.com/design/intarch/papers/cache6.pdf
But I am still confused.
Thanks.

The cache you are referring to is known as set associative cache. The whole cache is divided into sets and each set contains 4 cache lines(hence 4 way cache). So the relationship stands like this :
cache size = number of sets in cache * number of cache lines in each set * cache line size
Your cache size is 32KB, it is 4 way and cache line size is 32B. So the number of sets is
(32KB / (4 * 32B)) = 256
If we think of the main memory as consisting of cache lines, then each memory region of one cache line size is called a block. So each block of main memory will be mapped to a cache line (but not always to a particular cache line, as it is set associative cache).
In set associative cache, each memory block will be mapped to a fixed set in the cache. But it can be stored in any of the cache lines of the set. In your example, each memory block can be stored in any of the 4 cache lines of a set.
Memory block to cache line mapping
Number of blocks in main memory = (1GB / 32B) = 2^25
Number of blocks in each page = (4KB / 32B) = 128
Each byte address in the system can be divided into 3 parts:
Rightmost bits represent byte offset within a cache line or block
Middle bits represent to which cache set this byte(or cache line) will be mapped
Leftmost bits represent tag value
Bits needed to represent 1GB of memory = 30 (1GB = (2^30)B)
Bits needed to represent offset in cache line = 5 (32B = (2^5)B)
Bits needed to represent 256 cache sets = 8 (2^8 = 256)
So that leaves us with (30 - 5 - 8) = 17 bits for tag. As different memory blocks can be mapped to same cache line, this tag value helps in differentiating among them.
When an address is generated by the processor, 8 middle bits of the 30 bit address is used to select the cache set. There will be 4 cache lines in that set. So tags of the all four resident cache lines are checked against the tag of the generated address for a match.
Example
If a 30 bit address is 00000000000000000-00000100-00010('-' separated for clarity), then
offset within the cache is 2
set number is 4
tag is 0

In their "Computer Organization and Design, the Hardware-Software Interface", Patterson and Hennessy talk about caches. For example, in this version, page 408 shows the following image (I have added blue, red, and green lines):
Apparently, the authors use only the term "block" (and not the "line") when they describe set-associative caches. In a direct-mapped cache, the "index" part of the address addresses the line. In a set-associative, it indexes the set.
This visualization should get along well with #Soumen's explanation in the accepted answer.
However, the book mainly describes Reduced Instruction Set Architectures (RISC). I am personally aware of MIPS and RISC-V versions. So, if you have an x86 in front of you, take this picture with a grain of salt, more as a concept visualization than as actual implementation.

If we divide the memory into cache line sized chunks(i.e. 32B chunks of memory), each of this chunks is called a block. Now when you try to access some memory address, the whole memory block(size 32B) containing that address will be placed to a cache line.
No each set is not responsible for 4096KB or one particular memory page. Multiple memory blocks from different memory pages can be mapped to same cache set.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.

the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.

A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio