This question already has answers here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
(7 answers)
Closed 2 years ago.
I am undefeated in a question like this
In the context of a memory hierarchy why implement data cache and
instruction cache?
I replied that it is useful to decrease the number of conflict miss and insufficient space miss. But the data cache and the instruction cache can be sized according to the number of data and instruction? Because i assumed that the number of data is higher than the number of instruction (many times we need 2 data to execute 1 instruction) and the data cache and instruction cache is sized according to this numbers. Is true or completely wrong? In the case that it's wrong, why implement data cache and instruction cache to reduce miss?
The idea of a cache is to deliver cached data in 1 cycle to keep the CPU running at maximum speed.
Now today all CPUs are pipelined. This means the they have independent modules that e.g. fetch an instruction, decode it, fetch the operands, execute the instruction, and write back the result. All of these pipeline stages are executed whenever possible at the same time for different instructions.
For maximum speed, an instruction fetch has to be done at the same time as an operand fetch of an earlier instruction decoded before. Both can only be done (in the optimal case) at the same time in 1 cycle if one has an instruction cache and a data cache.
Another possible reason to have two caches (Instruction and Data) is Thrashing. Imagine a situation where your instruction and data reside in two memory locations whose index bits are the same. Assuming a direct mapped cache, (cheeky I know), It goes like this.
Fetch instruction from memory, calculate the index and store store it there.
Decode the instruction and get the address of data.
Now fetch the data from memory, calculate the index to store data.
There is some data in that location, well too bad, flush it to next level cache and store the newly fetched data.
Execute the instruction.
Its time to decode the next instruction, well its a cache miss as we swapped the cache entry for our data. Now go fetch it again.
When we fetch it we have to replace our data again as it has the same index.
So we will be continually swapping data and instruction form the same cache line, aka thrashing.
Related
This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?
Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.
I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.
At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.
In general desktops have 2 kinds of CPU cache to faster memory access.
1) Instruction cache -> to speed up executable instructions.
2) Data cache -> to speed up data fetch and store.
As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?
Instruction cache operates on code segment of a program and Data cache operates on data segment of program. Is this right?
No, CPU is unaware about segments.
Instruction cache is for all execution accesses, whether they are performed inside code segment, or in the heap as dynamically created code.
Data cache is for all other, non-execution accesses. Data can be in the data segment, heap, or even in the code segment as constants.
As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?
Memory is memory. The CPU can't tell the difference between the heap and the data.
Instruction caches usually just start with the address in the program counter and grab the next N bytes. The CPU still can't tell if its a code segment or a data segment.
When you write a program, this gets translated into machine readable binary. When CPU executes instructions, it fetches this binary, decodes, what it mean and then execute. Basically this binary tells CPU what instructions it has to execute. If this binary was only stored in main memory, then during each fetch stage, CPU has to access main memory, which is really bad. Instead what we do is to store, some of it in a cache, closer to the CPU. Since this cache only contain binary information related to the instructions to be executed, we call it instruction cache. Now instructions need data to operate. In your high level code you might have something like
arrayA[i] = (arrayB[i] + arrayC[i]) which will translate into a machine instruction something similar to
ADD memLocationStoredInRegisterA, memLocationStoredInRegisterB, memLocationStoredInRegisterC
This instruction is stored in instruction cache, but the data, i.e arrayA, arrayB and arrayC will be stored in another portion of memory. Again it will be useless to access main memory, each time this instruction is executed. Therefore we store some of this in another cache, which we call data cache.
I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it.
A write request is sent from cpu to cache.
Request results in a cache-miss.
A cache block is allocated for this request in cache.(Write-Allocate)
Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write)
Now we are able to write onto allocated and updated by fetch cache block.
Question is what happens between step 4 and step 5. (Lets say Cache is a non-blocking cache using Miss Status Handling Registers.)
Does CPU have to retry write request on cache until write-hit happens? (after fetching the block to the allocated cache block)
If not, where does write request data is being held in the meantime?
Edit: I think I've found my answer in Implementation of Write Allocate in the K86™ Processors . It is directly being written into the allocated cache block and it gets merged with the read request later on.
It is directly being written into the allocated cache block and it gets merged with the read request later on.
No, that's not what AMD's pdf says. They say the store-data is merged with the just-fetched data from memory and then stored into the L1 cache's data array.
Cache tracks validity with cache-line granularity. There's no way for it to store the fact that "bytes 3 to 6 are valid; keep them when data arrives from memory". That kind of logic is too big to replicate in each line of the cache array.
Also note that the pdf you found describes some specific behaviour of their AMD's K6 microarchitectures, which was single-core only, and some models only had a single level of cache, so no cache-coherency protocol was even necessary. They do describe the K6-III (model 9) using MESI between L1 and L2 caches.
A CPU writing to cache has to hold onto the data until the cache is ready to accept it. It's not a retry-until-success process, though. It's more like the cache notified the store hardware when it's ready to accept that store (i.e. it has that line active, and in the Modified state if the cache is coherent with other caches using the MESI protocol).
In a real CPU, multiple outstanding misses can be in flight at once (even without full out-of-order speculative execution). This is called miss under miss. The CPU<->cache connection needs a buffer for each outstanding miss that can be supported in parallel, to hold the store data. e.g. a core might have 8 buffers and support 8 outstanding load or store misses. A 9th memory operation couldn't start to happen until one of the 8 buffers became available. Until then, data would have to stay in the CPU's store queue.
These buffers might be shared between loads and stores, or there might be dedicated store buffers. The OP reports that searching on store buffer found lots of related stuff of interest; one example being this part of Wikipedia's MESI article.
The L1 cache is really a part of a CPU core in modern high-performance designs. It's very tightly integrated with the memory-order logic, and needs to be able to efficiently support atomic operations like lock inc [mem] and lots of other complications (like memory reordering). See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for example.
Some other terms:
store buffer
store queue
memory order buffer
cache write port / cache read port / cache port
globally visible
distantly related: An interesting post investigating the adaptive replacement policy of Intel IvyBridge's L3 cache, making it more resistant against evicting valuable data when scanning a huge array.
I am trying to complete a simulator based for a simplified mips computer using java. I believe I have completed the pipeline logic needed for my assignment but I am having a hard time understanding what the instruction and data caches are supposed to do.
The instruction cache should be direct-mapped with 4 blocks and the block size is 4 words.
So I am really confused on what the cache is doing. Is it going to memory and pulling the instruction from memory? For example, in one block it will have just the add command.
Would it make sense to implement it as a 2 dimensional array?
First you should know the basics of the cache. You can imagine cache as an intermediate memory which sits between the DRAM or main memory and your processor, however very much limited in size. Now, when you try to access a location in memory, you will search it first in the cache. If it is found (cache hit) the processor will take this data and resume the execution. Generally the cache hit is supposed to be very few clock cycles lets say 1 or 2. Suppose if the data is not found in the cache (cache miss), then the data is fetched from the main memory, filled in the cache and fed to the processor. The processor blocks till the data is fetched. This takes few hundreds of clock cycles normally depending on the DRAM you are using. The amount of data that is fetched from DRAM is equal to cacheline size. For that you should search for spatial locality of reference in caches.
I think this should get you a start.