Fully associative cache offset - caching

When dealing with 32 bit addresses and fully associative cache architecture, do we take away an offset from the address when comparing it with the tag of the cache, or do we take the full 32 bit address and compare it with the tag in the cache?
I am designing a cache simulator, and want to make sure I understood this portion right.

When your dealing with 32-bit addresses you use the bits from 31 to 5 for the tag and 4 to 0 for the offset. This means you're not taking all the address to compare with the tag.
i.e. Address A[31:0] is splitter into tag A[31:5] and offset A[4:0].
For more information see https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/fully.html

Related

Where are tag bits stored in direct mapped cache?

From my understanding, direct mapped cache compares tag bits. But where are tag bits stored? Are they inside cache? If yes, are they stored inside the cache block itself and actual block size is bigger?
The cache tag bits are the bits within an address (from the perspective of the CPU) that are used as a tag based on the size and width of the cache.
Let us assume a very simple cache with 8 64 byte lines
the 6 least significant bits represent a location within a 64 byte line. The next 3 bits would be the tag, since we only have 8 lines
bits in address:
... xxxx xxxt ttxx xxxx
Addresses 0x86 and 0x10080 would have the same tag in this example
This is an oversimplified example, and there are many nuances to caches, so I would recommend reading some more in depth material on the topic, or read about an actual implementation (i.e. a CPU manual) to get a much better feel for how this works

Cache memory design - address decoding

I was given the following problem:
A CPU generates 32 bit addresses for a byte addressable memory. Design an 8 KB cache memory for this CPU (8 KB is the cache size only for the data; it does not include the tag). The block size is 32 bytes. Show the block diagram, and the address decoding for direct mapped cache memory.
I determined that:
8 bits are needed for indexing
5 bits are needed for block offset
19 bits for the tag.
Is my solution correct? How should I do the decoding?
The numbers seem correct, however it is always worth pointing out in your solution that you're taking cache associativity under account. Specifically, 32-8-5=19 is only valid when the cache is directly mapped.
The decoding part is nicely illustrated in your drawing – it's simply the act of taking 32 bits of the address as used by the CPU apart into the tag, index, and offset fields.

Virtually indexed physically tagged cache Synonym

I am not able to entirely grasp the concept of synonyms or aliasing in VIPT caches.
Consider the address split as:-
Here, suppose we have 2 pages with different VA's mapped to same physical address(or frame no).
The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's as they are mapped to same physical frame.
Now the pageoffset part(bits 0-13) of both the VA's are same as the data which they want to access from a particular frame no is same.
As the pageoffset part of both VA's are same, bits (5-13) will also be same, so the index or set no is the same and hence there should be no aliasing as only single set or index no is mapped to a physical frame no.
How is bit 12 as shown in the diagram, responsible for aliasing ? I am not able to understand that.
It would be great if someone could give an example with the help of addresses.
Note: this diagram has a minor error that doesn't affect the question: 36 - 12 = 24-bit tags for 36-bit physical addresses, not 28. MIPS64 R4x00 CPUs do in fact have 40-bit virtual, 36-bit physical addresses, and 24-bit tags, according to chapters 4 and 11 of the manual.
This diagram is from http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html which does label it as being for MIPS R4x00.
The page offset is bits 0-11, not 0-13. Look at your bottom diagram: the page offset is the low 12 bits, so you have 4k pages (like x86 and other common architectures).
If any of the index bits come from above the page offset, VIPT no longer behaves like a PIPT with free translation for the index bits. That's the case here.
A process can have the same physical page (frame) mapped to 2 different virtual pages.
Your claim that The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA(bits 12-35) and the PFN remains same for both the VA's is totally bogus. Translation can change bit #12. So one of the index bits really is virtual and not also physical, so two entries for the same physical line can go in different sets.
I think my main confusion is regarding the page offset range. Is it the same for both PA and VA (that is 0-11) or is it 0-12 for VA and 0-11 for PA? Will they always be same?
It's always the same for PA and VA. The page offset isn't marked on the VA part of your diagram, only the range of bits used as the index.
It wouldn't make sense for it to be any different: virtual and physical memory are both byte-addressable (or word-addressable). And of course a page frame (physical page) is the same size as a virtual page. Right or left shifting an address during translation from virtual to physical would make no sense.
As discussed in comments:
I did eventually find http://www.cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html (which includes the diagram in the question!). It says the same thing: physical tagging does solve the cache homonym problem as an alternative to flushing on context switch.
But not the synonym problem. For that, you can have the OS ensure that bit 12 of every VA = bit 12 of every PA. This is called page coloring.
Page coloring would also solve the homonym problem without the hardware doing overlapping tag bits, because it gives 1 more bit that's the same between physical and virtual address. phys idx = virt idx. (But then the HW would be relying on software to be correct, if it wanted to depend on this invariant.)
Another reason for having the tag overlap the index is write-back during eviction:
Outer caches are almost always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.
A write-back cache needs to be able to evict dirty lines (send them to L2 or to physical RAM) long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere. How does the VIPT to PIPT conversion work on L1->L2 eviction
Having the tag include all the physical address bits above the page offset solves this problem: given the page-offset index bits and the tag, you can construct the full physical address.
(Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, there is no write-back; eviction = drop.)

How do I compute tag of full associative mapped cache?

I have a full associative mapping cache memory with globally 2^10 lines.
Each line is hosting 16 data.
The address bus a 32 bits.
How many bits are necessary for the TAG?
I think I have to make bus - the offset is it correct? how do I compute the offset?
The correct way is by subtracting to the adress bus the logarithm ofthe block. In this case so is 32-4=28.

Will _mm512_mask_prefetch_i32gather_ps() prefetch an entire cache line for each element?

The gather prefetch intrinsic _mm512_mask_prefetch_i32gather_ps can be used to prefetch 32 bit floats on Knights Corner.
Since a corresponding intrinsic for doubles does not exist, how should this intrinsic be used for prefetch 64 or 128 bit elements?
Does each 4 byte chunk needed to be explicitly prefetched, or can we assume that each prefetch of a 32 bit variable will actually prefetch the entire 64 byte cache line that it occupies?
Example:
I want to prefetch 4 doubles at offsets {1,2,10,12} from base address 0xf0000000.
This corresponds to addresses of {0xf0000008, 0xf0000010, 0xf0000050, 0xf0000060}.
These occupy two cache lines starting at {0xf0000000, 0xf0000040}.
Would it be sufficient to use _mm512_mask_prefetch_i32gather_ps with the base addresses of these two cache lines?
I originally posted this question on the Intel MIC forum without success.

Resources