Capacity miss occurs because blocks are being discarded from cache because cache cannot contain all blocks needed for program execution (program working set is much larger than cache capacity).
Conflict miss occurs in the case of set associative or direct mapped block placement strategies, conflict misses occur when several blocks are mapped to the same set or block frame; also called collision misses or interference misses.
Are they actually very closely related?
For example, if all the cache lines are filled and we have a read request for memory B, for which we have to evict memory A.
So should it be considered as a capacity miss since we don't have enough space? And later if we want to access memory A, and since it's evicted before, it's considered as a conflict miss.
Am I understanding this correctly? Thanks
The important distinction here is between cache misses caused by the size of your data set, and cache misses caused by the way your cache and data alignment are organized.
Lets assume you have a 32k direct mapped cache, and consider the following 2 cases:
You repeatedly iterate over a 128k array. There's no way the data can fit in that cache, therefore all the misses are capacity ones (except the first access of each line which is a compulsory miss, and would remain even if you could increase your cache infinitely).
You have 2 small 8k arrays, but unfortunately they are both aligned and map to the same sets. This means that while they could theoretically fit in the cache (if you fix your alignment), they will not utilize the full cache size and instead compete for the same group of sets and thrash each other. These are conflict misses, since the data could fit, but still collides due to organization. The same problem can occur with set associative caches, although less often (let's say the cache is 2-way, but you have 4 aligned data sets...).
The 2 types are indeed related, you could say that given high levels of associativity, set skewing, proper data alignments and other techniques, you could reduce the conflicts, until you're mostly left with true capacity misses that are unavoidable.
My favorite definition for conflict misses from Reducing Compulsory and Capacity misses by Norman P. Jouppi:
Conflict misses are misses that would not occur if the cache were
fully associative with LRU replacement.
Let's look at an example. We have a direct-mapped cache of size of 4. The access sequences are
0(compulsory miss), 1(compulsory miss), 2(compulsory miss), 3(compulsory miss), 4(compulsory
miss), 1(hit), 2(hit), 3(hit), 0(capacity miss), 4(capacity miss), 0(conflict miss)
The second to last 0 is a capacity miss because even if the cache were fully associative with LRU cache, it would still cause a miss because 4,1,2,3 are accessed before last 0. However the last 0 is a conflict miss because in a fully associative cache the last 4 would have replace 1 in the cache instead of 0.
Compulsory miss: when a block of main memory is trying to occupy fresh empty line of cache and the very first access to a memory Block that must be brought into cache is called compulsory miss.
Conflict miss: when still there are empty lines in the cache, block of main memory is conflicting with the already filled line of cache, ie., even when empty place is available, block is trying to occupy already filled line. its called conflict miss.
Capacity miss: miss occured when all lines of cache are filled.
conflict miss occurs only in direct mapped cache and set-associative cache. Because in associative mapping, no block of main memory tries to occupy already filled line.
Related
Could someone Please explain whether a capacity miss can happen in a direct mapped or set-associative caches? If that so, in what kind of instances?
capacity misses are mostly happen in fully associative caches because there's no mapping between memory and cache so that any item goes to a vacant slot in cache and finally the cache can be completely filled and may happen to replace incoming values with existing values.
There the cache misses (other than the compulsory miss) happen due to the capacity issue.
But in direct mapped caches, most of the cases what happens is, there's still cache slots available but the slot which the particular item is mapped is already contain another value. There a conflict is occurred.
This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?
Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.
From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.
Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])
Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.
Suppose I have two memory segments (equal size each, approximately 1kb in size) , one is read-only (after initialization), and other is read/write.
what is the best layout in memory for such segments in terms of memory performance? one allocation, contiguous segments or two allocations (in general not contiguous). my primary architecture is linux Intel 64-bit.
my feeling is former (cache friendlier) case is better.
is there circumstances, where second layout is preferred?
I would put the 2KB of data in the middle of a 4KB page, to avoid interference from reads and writes close to the page boundary. Similarly, keeping the write data separate is also good idea for the same reason.
Having contiguous read/write blocks may be less effiicent than keeping them separate. For example, a cache that is storing data for code interested in just the read-only portion may become invalidated by a write from another cpu. The cache line will be invalidated and refreshed, even though the code wasn't reading the writable data. By keeping the blocks separate, you avoid this case, and writes to the writable data block only invalidate cache lines for the writable block, and do not interfere with cache lines for the read only block.
Note that this is only a concern at the block boundary between the readable and writable blocks. If your block sizes were much larger than the cache line size, then this would be a peripheral problem, but as your blocks are small, requiring just a few cache lines, then the problem of invalidating lines could be significant.
With that small of data, it really shouldn't matter much. Both of those arrays will fit into any level cache just fine.
It'll depend on what you're doing with the memory. I'm fairly certain that contiguous (and page aligned!) would never be slower than two randomly placed segments, but it won't necessarily be any faster.
Given that it's an Intel processor, you probably only need to ensure that the addresses are not exactly a multiple of 64k apart. If they are, loads from either section that map to the same modulo 64k address will collide in L1 and cause an L1 miss. There's also a 4MB aliasing issue, but I'd be surprised if you ran into that.
Several of the resources I've gone to on the internet have disagree on how set associative caching works.
For example hardware secrets seem to believe it works like this:
Then the main RAM memory is divided in
the same number of blocks available in
the memory cache. Keeping the 512 KB
4-way set associative example, the
main RAM would be divided into 2,048
blocks, the same number of blocks
available inside the memory cache.
Each memory block is linked to a set
of lines inside the cache, just like
in the direct mapped cache.
http://www.hardwaresecrets.com/printpage/481/8
They seem to be saying that each cache block(4 cache lines) maps to a particular block of contiguous RAM. They are saying non-contiguous blocks of system memory(RAM) can't map to the same cache block.
This is there picture of how hardwaresecrets thinks it works
http://www.hardwaresecrets.com/fullimage.php?image=7864
Contrast that with wikipedia's picture of set associative cache
http://upload.wikimedia.org/wikipedia/commons/9/93/Cache%2Cassociative-fill-both.png.
Brown disagrees with hardware secrets
Consider what might happen if each
cache line had two sets of fields: two
valid bits, two dirty bits, two tag
fields, and two data fields. One set
of fields could cache data for one
area of main memory, and the other for
another area which happens to map to
the same cache line.
http://www.spsu.edu/cs/faculty/bbrown/web_lectures/cache/
That is, non-contiguous blocks of system memory can map to the same cache block.
How are the relationships between non-contiguous blocks on system memory and cache blocks created. I read somewhere that these relationships are based on cache strides, but I can't find any information on cache strides other than that they exist.
Who is right?
If striding is actually used, how does striding work and do I have the correct technical name? How do I find the stride for a particular system? is it based on the paging system? Can someone point me to a url that explains N-way set associative cache in great detail?
also see:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html
When I teach cache memory architecture to my students, I start with a direct-mapped cache. Once that is understood, you can think of N-way set associative caches as parallel blocks of direct-mapped cache. To understand that both figures may be correct, you need to first understand the purpose of set-assoc caches.
They are designed to work around the problem of 'aliasing' in a direct-mapped cache, where multiple memory locations can map to a specific cache entry. This is illustrated in the Wikipedia figure. So, instead of evicting a cache entry, we can use a N-way cache to store the other 'aliased' memory locations.
In effect, the hardware secrets diagram would be correct assuming the order of replacement is such that the first chunk of main memory is mapped to Way-1 and then the second chunk to Way-2 and so on so forth. However, it is equally possible to have the first chunk of main memory spread over multiple Ways.
Hope this explanation helps!
PS: Contiguous memory locations are only needed for a single cache line, exploiting spatial locality. As for the latter part of your question, I believe that you may be confusing several different concepts.
The replacement policy decides where in the cache a copy of a
particular entry of main memory will go. If the replacement policy is
free to choose any entry in the cache to hold the copy, the cache is
called fully associative. At the other extreme, if each entry in main
memory can go in just one place in the cache, the cache is direct
mapped. Many caches implement a compromise in which each entry in main
memory can go to any one of N places in the cache, and are described
as N-way set associative