Virtually addressed Cache - caching

Relation between cache size and page size
How does the associativity and page size constrain the Cache size in virtually addressed cache architecture?
Particularly I am looking for an example on the following statement:
If C≤(page_size x associativity), the cache index bits come only
from page offset (same in Virtual address and Physical address).

Intel CPUs have used 8-way associative 32kiB L1D with 64B lines for many years, for exactly this reason. Pages are 4k, so the page offset is 12 bits, exactly the same number of bits that make up the index and offset within a cache line.
See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer for more details about how it lets the cache avoid aliasing problems like a PIPT cache, but still be as fast as a VIPT cache.
The idea is that the virtual address bits below the page offset are already physical address bits. So a VIPT cache that works this way is more like a PIPT cache with free translation of the index bits.

Related

Cache memory design - address decoding

I was given the following problem:
A CPU generates 32 bit addresses for a byte addressable memory. Design an 8 KB cache memory for this CPU (8 KB is the cache size only for the data; it does not include the tag). The block size is 32 bytes. Show the block diagram, and the address decoding for direct mapped cache memory.
I determined that:
8 bits are needed for indexing
5 bits are needed for block offset
19 bits for the tag.
Is my solution correct? How should I do the decoding?
The numbers seem correct, however it is always worth pointing out in your solution that you're taking cache associativity under account. Specifically, 32-8-5=19 is only valid when the cache is directly mapped.
The decoding part is nicely illustrated in your drawing – it's simply the act of taking 32 bits of the address as used by the CPU apart into the tag, index, and offset fields.

Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical

This question comes in context of a section on virtual memory in an undergraduate computer architecture course. Neither the teaching assistants nor the professor were able to answer it sufficiently, and online resources are limited.
Question:
Suppose a processor with the following specifications:
8KB pages
32-bit virtual addresses
28-bit physical addresses
a two-level page table, with a 1KB page table at the first level, and 8KB page tables at the
second level
4-byte page table entries
a 16-entry 8-way set associative TLB
in addition to the physical frame (page) number, page table entries contain a valid bit, a
readable bit, a writeable bit, an executable bit, and a kernel-only bit.
Now suppose this processor has a 32KB L1 cache whose tags are computed based on physical addresses. What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
Intuition:
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page, thus providing a small speed-up. However, I am unsure if this is the correct intuition and definitely don't know how to follow through with it. Could someone please explain this?
Note: I have computed the number of page table entries to be 2^19, if that helps anyone.
What is the minimum associativity that cache must have to allow the appropriate cache set to be accessed before computing the physical address that corresponds to a virtual address?
They're only specified that the cache is physically tagged.
You can always build a virtually indexed cache, no minimum associativity. Even direct-mapped (1 way per set) works. See Cache Addressing Methods Confusion for details on VIPT vs. PIPT (and VIVT, and even the unusual PIVT).
For this question not to be trivial, I assume they also meant "without creating aliasing problems", so VIPT is just a speedup over PIPT (physically indexed, phyiscally tagged). You get the benefit of allowing TLB lookup in parallel with fetching tags (and data) for the ways of the indexed set without any downsides.
My intuition is that if the number of indices in the cache and the number of virtual pages (aka page table entries) is evenly divisible by each other, then we could retrieve the bytes contained within the physical page directly from the cache without ever computing that physical page
You need the physical address to check against the tags; remember your cache is physically tagged. (Virtually tagged caches do exist, but typically have to get flushed on context switches to a process with different page tables = different virtual address space. This used to be used for small L1 caches on old CPUs.)
Having both numbers be a power of 2 is normally assumed, so they're always evenly divisible.
Page sizes are always a power of 2 so you can split an address into page number and offset-within-page by just taking different ranges of bits in the address.
Small/fast cache sizes also always have a power of 2 number of sets so the index "function" is just taking a range of bits from the address. For a virtually-indexed cache: from the virtual address. For a physically-indexed cache: from the physical address. (Outer caches like a big shared L3 cache may have a fancier indexing function, like a hash of more address bits, to avoid aliasing for addresses offset from each other by a large power of 2.)
The cache size might not be a power of 2, but you'd do that by having a non-power-of-2 associativity (e.g. 10 or 12 ways is not rare) rather than a non-power-of-2 line size or number of sets. After indexing a set, the cache fetches the tags for all the ways of that set and compare them in parallel. (And for fast L1 caches, often fetch the data selected by the line-offset bits in parallel, too, then the comparators just mux that data into the output, or raise a flag for no match.)
Requirements for VIPT without aliasing (like PIPT)
For that case, you need all index bits to come from below the page offset. They translate "for free" from virtual to physical so a VIPT cache (that indexes a set before TLB lookup) has no homonym/synonym problems. Other than performance, it's PIPT.
My detailed answer on Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? includes a section on that speed hack.
Virtually indexed physically tagged cache Synonym shows a case where the cache does not have that property, and needs page coloring by the OS to let avoid synonym problems.
How to compute cache bit widths for tags, indices and offsets in a set-associative cache and TLB has some more notes about cache size / associativity that give that property.
Formula:
min associativity = cache size / page size
e.g. a system with 8kiB pages needs a 32kiB L1 cache to be at least 4-way associative so that index bits only come from the low 13.
A direct-mapped cache (1 way per set) can only be as large as 1 page: byte-within-line and index bits total up to the byte-within-page offset. Every byte within a direct-mapped (1-way) cache must have a unique index:offset address, and those bits come from contiguous low bits of the full address.
To put it another way, 2^(idx_bits + within_line_bits) is the total cache size with only one way per set. 2^N is the page size, for a page offset of N (the number of byte-within-page address bits that translate for free).
The actual number of sets (in this case = lines) depends on the line size and page size. Using smaller / larger lines would just shift the divide between offset and index bits.
From there, the only way to make the cache bigger without indexing from higher address bits is to add more ways per set, not more ways.

Virtually indexed physically tagged cache Synonym

I am not able to entirely grasp the concept of synonyms or aliasing in VIPT caches.
Consider the address split as:-
Here, suppose we have 2 pages with different VA's mapped to same physical address(or frame no).
The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's as they are mapped to same physical frame.
Now the pageoffset part(bits 0-13) of both the VA's are same as the data which they want to access from a particular frame no is same.
As the pageoffset part of both VA's are same, bits (5-13) will also be same, so the index or set no is the same and hence there should be no aliasing as only single set or index no is mapped to a physical frame no.
How is bit 12 as shown in the diagram, responsible for aliasing ? I am not able to understand that.
It would be great if someone could give an example with the help of addresses.
Note: this diagram has a minor error that doesn't affect the question: 36 - 12 = 24-bit tags for 36-bit physical addresses, not 28. MIPS64 R4x00 CPUs do in fact have 40-bit virtual, 36-bit physical addresses, and 24-bit tags, according to chapters 4 and 11 of the manual.
This diagram is from http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html which does label it as being for MIPS R4x00.
The page offset is bits 0-11, not 0-13. Look at your bottom diagram: the page offset is the low 12 bits, so you have 4k pages (like x86 and other common architectures).
If any of the index bits come from above the page offset, VIPT no longer behaves like a PIPT with free translation for the index bits. That's the case here.
A process can have the same physical page (frame) mapped to 2 different virtual pages.
Your claim that The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's is totally bogus. Translation can change bit #12. So one of the index bits really is virtual and not also physical, so two entries for the same physical line can go in different sets.
I think my main confusion is regarding the page offset range. Is it the same for both PA and VA (that is 0-11) or is it 0-12 for VA and 0-11 for PA? Will they always be same?
It's always the same for PA and VA. The page offset isn't marked on the VA part of your diagram, only the range of bits used as the index.
It wouldn't make sense for it to be any different: virtual and physical memory are both byte-addressable (or word-addressable). And of course a page frame (physical page) is the same size as a virtual page. Right or left shifting an address during translation from virtual to physical would make no sense.
As discussed in comments:
I did eventually find http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html (which includes the diagram in the question!). It says the same thing: physical tagging does solve the cache homonym problem as an alternative to flushing on context switch.
But not the synonym problem. For that, you can have the OS ensure that bit 12 of every VA = bit 12 of every PA. This is called page coloring.
Page coloring would also solve the homonym problem without the hardware doing overlapping tag bits, because it gives 1 more bit that's the same between physical and virtual address. phys idx = virt idx. (But then the HW would be relying on software to be correct, if it wanted to depend on this invariant.)
Another reason for having the tag overlap the index is write-back during eviction:
Outer caches are almost always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.
A write-back cache needs to be able to evict dirty lines (send them to L2 or to physical RAM) long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere. How does the VIPT to PIPT conversion work on L1->L2 eviction
Having the tag include all the physical address bits above the page offset solves this problem: given the page-offset index bits and the tag, you can construct the full physical address.
(Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, there is no write-back; eviction = drop.)

How does lookup the L1 and L2 cache?

Recently I was reading some material on cpu cache. I am wondering how does the cpu lookup the L1 and L2 cache and in what format is the data in the cpu cache stored?
I think a linear scan of the cache would be inefficient, are there any better solutions?
Thanks.
It uses index bits and tags extracted from the address it is looking up.
Say you are accessing some 32 bit address ADDR
ADDR will have bits: 31--------------------------0, [------tag|index|offset]
Then depending on the size of your cache:
Let's say you have a 32K, Direct Mapped cache with 32bytes per block.
Offset bits are used to find the data within each line because 8bytes is a minimum data size to be brought into the cache (well you always get the full 32bytes, but within the 32bytes you will have your data.)
This accounts for a cache with 1024 lines or sets, again each line with 32bytes. In order to index the 1024 sets you need 10bits. Thus the 10 bits from your address are used as an index into the cache. The offset bits are used to see where inside that line your data is , and the tag bits are used to match the address that you are looking up since two or more addresses will map into the same line of the cache.
Makes sense?
I do not know your answer, but I can recommend a good book that might lead you to one - The Essentials Of Computer Organization and Architecture

In what circumstances can large pages produce a speedup?

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.
The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?
There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.
The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.
For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:
L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries
While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:
L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: none
Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.
What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:
5 trips to memory to load data mapped on a 4kB page,
4 trips to memory to load data mapped on a 2MB page, and
3 trips to memory to load data mapped on a 1GB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information.
The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf
Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.
I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.
volatile char force_result;
const size_t mb=512;
const size_t stride=4096;
std::vector<char> src(mb<<20,0xff);
std::vector<size_t> idx;
for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
random0n r0n(/*seed=*/23);
std::random_shuffle(idx.begin(),idx.end(),r0n);
{
scoped_timer t
("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
char hash=0;
for (size_t i=0;i<idx.size();++i)
hash=(hash^src[idx[i]]);
force_result=hash;
}
A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).
I've seen improvement in some HPC/Grid scenarios - specifically physics packages with very, very large models on machines with lots and lots of RAM. Also the process running the model was the only thing active on the machine. I suspect, though have not measured, that certain DB functions (e.g. bulk imports) would benefit as well.
Personally, I think that unless you have a very well profiled/understood memory access profile and it does a lot of large memory access, it is unlikely that you will see any significant improvement.
This is getting esoteric, but Huge TLB pages make a significant difference on the Intel Xeon Phi (MIC) architecture when doing DMA memory transfers (from Host to Phi via PCIe). This Intel link describes how to enable huge pages. I found increasing DMA transfer sizes beyond 8 MB with normal TLB page size (4K) started to decrease performance, from about 3 GB/s to under 1 GB/s once the transfer size hit 512 MB.
After enabling huge TLB pages (2MB), the data rate continued to increase to over 5 GB/s for DMA transfers of 512 MB.
I get a ~5% speedup on servers with a lot of memory (>=64GB) running big processes.
e.g. for a 16GB java process, that's 4M x 4kB pages but only 4k x 4MB pages.

Resources