Calculating CPU cache hits - caching

I am struggling to understand what the process/formula is for calculating cache hits. So, if for example if we have a main memory with 16 entries and a cache memory of 4 entries and the CPU loads the memory addresses: 0, 1, 2, 8, 9, 2’, how can I calculate the number of hits a) if the cache is direct-mapped and b) 2-way associative?

Assuming no prefetchers and LRU as the replacing mechanism.
a)
For direct-mapped cache, each memory entry can be only in one cache entry.
The cache will map like this (default assuming uniform distribution):
Cache 0 --> can hold 0,4,8,12 of the main memory entries.
Cache 1 --> can hold 1,5,9,13 of the main memory entries.
Cache 2 --> can hold 2,6,10,14 of the main memory entries.
Cache 3 --> can hold 3,7,11,15 of the main memory entries.
After reset, the cache is empty.
Load from 0 will be missed and will be cached in cache entry 0.
Load from 1 will be missed and will be cached in cache entry 1.
Load from 2 will be missed and will be cached in cache entry 2.
Load from 8 will be missed and will be cached in cache entry 0 (replaced load 0).
Load from 9 will be missed and will be cached in cache entry 1 (replaced load 1).
load from 2 will be hit and will be taken from cache entry 2.
so we have 1 hit and and 5 misses, the hit rate is 1/(5+1) = 1/6 = 16%
b) For 2 ways associative, you will have for each entry in the memory 2 entries in the cache to go to.
so set0 (entries 0,1 in the cache) will hold all the even main memory entries and set1 will hold all the odd entries, so if we span it we will have like this:
cache 0 (set 0) --> can hold 0,2,4,6,8,10,12,14 of the main memory entries.
cache 1 (set 0) --> can hold 0,2,4,6,8,10,12,14 of the main memory entries.
cache 2 (set 1) --> can hold 1,3,5,7,9,11,13,15 of the main memory entries.
cache 2 (set 0) --> can hold 1,3,5,7,9,11,13,15 of the main memory entries.
After reset the cache is empty.
Load from 0 will be missed and will be cached in cache entry 0.
Load from 1 will be missed and will be cached in cache entry 2.
Load from 2 will be missed and will be cached in cache entry 1.
Load from 8 will be missed and will be cached in cache entry 0 (replaced load 0 because we replace the Least Recently Used).
Load from 9 will be missed and will be cached in cache entry 3.
load from 2 will be hit and will be taken from cache entry 1.
in this case the hit rate is the same: 1/(5+1) = 1/6 = 16%

Related

Look Through vs Look aside

Suppose there are 2 caches L1 and L2
L1
Hit rate of L1=0.8
Access time of l1=2ns
and transfer time b/w L1 and CPU is 10ns
L2
Hit rate of L2=0.9
Access time of L2 =5ns
and transfer time b/w L2 and L1 is 100ns
What will be the effective access time in case of Look through and Look aside policies.
Look through and Look aside is the read policy of cache architecture.
First , We will see difference between them
(1) - LOOK THROUGH Policy = If processor wants to search content , it will first look into cache , if cache hits -- get content , if cache miss (here it will search into L2 and then go to main memory) it will go to main memory , read block from main memory and copy block into cache for further access...
Here , To calculate Access time
h = hit rate
c = cache access time
m = main memory access time
Access time = h * c + (1 - h ) * ( c + m )
for L1 = 2 + 10 = 12 ns
for (through L1) L2 = L1 time + 5 + 100 = 117 ns
for (through L1 + L2 ) memory = L1 + L2 + Mem = Mem ns
Access time = (0.8 * 12 ) + (0.18 * 117) + (0.02 * Mem ).
(2) LOOK ASIDE policy = Processor simultaneously look for content in both cache as well as in main memory....
Look aside requires more signal operation for every access(cache and main memory) and when content found in cache , it require to send a cancel signal to main memory..which is biggest disadvantage of look aside policy..
Here , To calculate Access time
you have to consider all signaling time for all operation ....
Note - Most of cache uses look through cache , because now a days , cache hit ratio is more than 95% ..so most of time content is available in cache....
[For software/application cache]In both, look-aside and look-through caches, the data is looked up first in the cache. In the look-aside case, it is the responsibility of the application to maintain the consistency of the data in the cache and insert the data back into cache, whereas in the look-through case, the consistency is handled transparently by the cache, without the application being involved.
This means that for look-aside cache, the application sends the request to the main memory, while in look-through cache the request is forwarded from the cache itself.
See the slides 14 and 15 in this slide deck for the visual illustration: https://www.cs.princeton.edu/courses/archive/fall19/cos316/lectures/08-caching.pdf

Computer Architecture: Cache Transfer Analysis

This is a question on my exam study guide and we have not yet covered how to calculate data transfer. Any help would be greatly appreciated.
Given is an 8 way set associative level 2 data cache with a capacity of 2 MByte (1MByte = 2^20 Byte)
and a block size 128 Bytes. The cache is connected to the main memory by a shared 32 bit address and
data bus. The cache and the RISC-CPU are connected by a separated address and data bus, each with a
width of 32 bit. The CPU is executing a load word instruction
a) How much user data is transferred from the main memory to the cache in case of a cache miss?
b) How much user data is transferred from the cache to the CPU in case of a cache miss?
You need to compute first your cache line size:
Number of cache blocks: 2MB / 128B = 16384 blocks (14 bits)
Number of sets: 16384 / 8 way = 2048 sets (11 bits)
Address width: 32 bits
Line offset bits: 32 - 14 - 11 = 7 bits
So the cache line size is 128B - actually a line is a block but it's good to know the above computation.
a) How much user data is transferred from the main memory to the cache
in case of a cache miss?
In your problem, the L2 cache is the last level cache before main memory. So if you miss in the L2 cache (you don't find the line you are looking for), you need to fetch the line from main memory. So 128B of user data will be transferred from the main memory. The fact that the address bus and data bus are shared does not influence.
b) How much user data is transferred from the cache to the CPU in case
of a cache miss?
If you reached the L2 cache that means you missed the L1 cache. So from L2 the CPU has to transfer to L1 a full L1 cache line. So the L1 line size is 128B, then 128B of data will go from L2 to L1. The CPU will use then only a fraction of that line to feed the instruction that generated the miss into the L1 cache. Whether that line is evicted or not from L2, this should have been stated in the problem sentence (inclusive / exclusive cache)

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4 cache lines.
How many cache ways do I have? I am not even sure what a cache way is. Can someone explain that? I have done some searching, the best example was
http://download.intel.com/design/intarch/papers/cache6.pdf
But I am still confused.
Thanks.
The cache you are referring to is known as set associative cache. The whole cache is divided into sets and each set contains 4 cache lines(hence 4 way cache). So the relationship stands like this :
cache size = number of sets in cache * number of cache lines in each set * cache line size
Your cache size is 32KB, it is 4 way and cache line size is 32B. So the number of sets is
(32KB / (4 * 32B)) = 256
If we think of the main memory as consisting of cache lines, then each memory region of one cache line size is called a block. So each block of main memory will be mapped to a cache line (but not always to a particular cache line, as it is set associative cache).
In set associative cache, each memory block will be mapped to a fixed set in the cache. But it can be stored in any of the cache lines of the set. In your example, each memory block can be stored in any of the 4 cache lines of a set.
Memory block to cache line mapping
Number of blocks in main memory = (1GB / 32B) = 2^25
Number of blocks in each page = (4KB / 32B) = 128
Each byte address in the system can be divided into 3 parts:
Rightmost bits represent byte offset within a cache line or block
Middle bits represent to which cache set this byte(or cache line) will be mapped
Leftmost bits represent tag value
Bits needed to represent 1GB of memory = 30 (1GB = (2^30)B)
Bits needed to represent offset in cache line = 5 (32B = (2^5)B)
Bits needed to represent 256 cache sets = 8 (2^8 = 256)
So that leaves us with (30 - 5 - 8) = 17 bits for tag. As different memory blocks can be mapped to same cache line, this tag value helps in differentiating among them.
When an address is generated by the processor, 8 middle bits of the 30 bit address is used to select the cache set. There will be 4 cache lines in that set. So tags of the all four resident cache lines are checked against the tag of the generated address for a match.
Example
If a 30 bit address is 00000000000000000-00000100-00010('-' separated for clarity), then
offset within the cache is 2
set number is 4
tag is 0
In their "Computer Organization and Design, the Hardware-Software Interface", Patterson and Hennessy talk about caches. For example, in this version, page 408 shows the following image (I have added blue, red, and green lines):
Apparently, the authors use only the term "block" (and not the "line") when they describe set-associative caches. In a direct-mapped cache, the "index" part of the address addresses the line. In a set-associative, it indexes the set.
This visualization should get along well with #Soumen's explanation in the accepted answer.
However, the book mainly describes Reduced Instruction Set Architectures (RISC). I am personally aware of MIPS and RISC-V versions. So, if you have an x86 in front of you, take this picture with a grain of salt, more as a concept visualization than as actual implementation.
If we divide the memory into cache line sized chunks(i.e. 32B chunks of memory), each of this chunks is called a block. Now when you try to access some memory address, the whole memory block(size 32B) containing that address will be placed to a cache line.
No each set is not responsible for 4096KB or one particular memory page. Multiple memory blocks from different memory pages can be mapped to same cache set.

DIfference between eviction due to clflush and eviction due to access to same set by other process

As per my understanding, when we use clflush(&Array1[i]), then we actually manually evict the cache line where this Array1[i] resides and it is guaranteed that the element ,Array1[i] will not present in cache and next time after clflush when we try to access the element , Array1[i] ,it needs to be loaded from higher level of cache and thus higher access time as compared to access time before clflush.
Is there a way to check whether the processor cache has been flushed recently?
took 81 ticks
took 81 ticks
flush: took 387 ticks // result is as expected
took 72 ticks
Now, Suppose, Array1[i] maps to cache set 'P' ( W=8, way of associativity), so I am assuming here 8 elements of Array1, is loaded into all 8 cache lines of same set 'P'. I have loaded them into 8 blocks somehow by accessing other elements of Array1.
Let, other array element, Array2[j] maps to ONLY one block of set 'P'.
here is my questions,
Question 1 :
if I access Array2[j] element then as per my understanding of LRU cache replacement strategy, one element of Array1, will be evicted from one block of set 'P' and will make room for new element Array2[j] which also maps to same set. Here selection of cache line/block is determined according to LRU policy.
Am I correct ?
Question2 :
if the Array1, element is evicted from the cache to make room for Array2[j] element , will it be stored in Victim cache (both Array1 and Array2 are in cache L2 ) , and next time it will be loaded from that victim cache OR the Array1 element will loaded from higher cache, not from victim cache ?
Question3 :
If Array1 is evicted to make room for Array2[j] element, so if we find access time for the Array1 element , will we get higher time for it always as compared to access time before accessing Array2[j] element ?
If I replace clflush command, by access of Array2[j] which also does the same as clflush, flushing of Array1 element from that cache lines, then I am getting the result for the cache lines where Array1 element is replaced by access of Array2[j] element
took 81 ticks
took 81 ticks
****ACCESS ARRAY2[j] element: took 75 ticks**
took 72 ticks
**I am confused ,ACCESS ARRAY2[j] element: took 75 ticks**** WHY it is less ? what is correct answer, I am expecting higher access time for the block which is evicted by Array2 element.
I am using Core i7 machine and use of correct rdtsc is here,
clflush() in i3 or i7 processors
Can anyone explain , Thanks in advance

How does direct mapped cache work?

I am taking a System Architecture course and I have trouble understanding how a direct mapped cache works.
I have looked in several places and they explain it in a different manner which gets me even more confused.
What I cannot understand is what is the Tag and Index, and how are they selected?
The explanation from my lecture is:
"Address divided is into two parts
index (e.g 15 bits) used to address (32k) RAMs directly
Rest of address, tag is stored and compared with incoming tag. "
Where does that tag come from? It cannot be the full address of the memory location in RAM since it renders direct mapped cache useless (when compared with the fully associative cache).
Thank you very much.
Okay. So let's first understand how the CPU interacts with the cache.
There are three layers of memory (broadly speaking) - cache (generally made of SRAM chips), main memory (generally made of DRAM chips), and storage (generally magnetic, like hard disks). Whenever CPU needs any data from some particular location, it first searches the cache to see if it is there. Cache memory lies closest to the CPU in terms of memory hierarchy, hence its access time is the least (and cost is the highest), so if the data CPU is looking for can be found there, it constitutes a 'hit', and data is obtained from there for use by CPU. If it is not there, then the data has to be moved from the main memory to the cache before it can be accessed by the CPU (CPU generally interacts only with the cache), that incurs a time penalty.
So to find out whether the data is there or not in the cache, various algorithms are applied. One is this direct mapped cache method. For simplicity, let's assume a memory system where there are 10 cache memory locations available (numbered 0 to 9), and 40 main memory locations available (numbered 0 to 39). This picture sums it up:
There are 40 main memory locations available, but only upto 10 can be accommodated in the cache. So now, by some means, the incoming request from CPU needs to be redirected to a cache location. That has two problems:
How to redirect? Specifically, how to do it in a predictable way which will not change over time?
If the cache location is already filled up with some data, the incoming request from CPU has to identify whether the address from which it requires the data is same as the address whose data is stored in that location.
In our simple example, we can redirect by a simple logic. Given that we have to map 40 main memory locations numbered serially from 0 to 39 to 10 cache locations numbered 0 to 9, the cache location for a memory location n can be n%10. So 21 corresponds to 1, 37 corresponds to 7, etc. That becomes the index.
But 37, 17, 7 all correspond to 7. So to differentiate between them, comes the tag. So just like index is n%10, tag is int(n/10). So now 37, 17, 7 will have the same index 7, but different tags like 3, 1, 0, etc. That is, the mapping can be completely specified by the two data - tag and index.
So now if a request comes for address location 29, that will translate to a tag of 2 and index of 9. Index corresponds to cache location number, so cache location no. 9 will be queried to see if it contains any data, and if so, if the associated tag is 2. If yes, it's a CPU hit and the data will be fetched from that location immediately. If it is empty, or the tag is not 2, it means that it contains the data corresponding to some other memory address and not 29 (although it will have the same index, which means it contains a data from address like 9, 19, 39, etc.). So it is a CPU miss, and data from location no. 29 in main memory will have to be loaded into the cache at location 9 (and the tag changed to 2, and deleting any data which was there before), after which it will be fetched by CPU.
Lets use an example. A 64 kilobyte cache, with 16 byte cache-lines has 4096 different cache lines.
You need to break the address down into three different parts.
The lowest bits are used to tell you the byte within a cache line when you get it back, this part isn't directly used in the cache lookup. (bits 0-3 in this example)
The next bits are used to INDEX the cache. If you think of the cache as a big column of cache lines, the index bits tell you which row you need to look in for your data. (bits 4-15 in this example)
All the other bits are TAG bits. These bits are stored in the tag store for the data you have stored in the cache, and we compare the corresponding bits of the cache request to what we have stored to figure out if the data we are cacheing are the data that are being requested.
The number of bits you use for the index is log_base_2(number_of_cache_lines) [it's really the number of sets, but in a direct mapped cache, there are the same number of lines and sets]
A direct mapped cache is like a table that has rows also called cache line and at least 2 columns one for the data and the other one for the tags.
Here is how it works: A read access to the cache takes the middle part of the address that is called index and use it as the row number. The data and the tag are looked up at the same time.
Next, the tag needs to be compared with the upper part of the address to decide if the line is from the same address range in memory and is valid. At the same time, the lower part of the address can be used to select the requested data from cache line (I assume a cache line can hold data for several words).
I emphasized a little on data access and tag access+compare happens at the same time, because that is key to reduce the latency (purpose of a cache). The data path ram access doesn't need to be two steps.
The advantage is that a read is basically a simple table lookup and a compare.
But it is direct mapped that means for every read address there is exactly one place in the cache where this data could be cached. So the disadvantage is that a lot of other addresses would be mapped to the same place and may compete for this cache line.
I have found a good book at the library that has offered me the clear explanation I needed and I will now share it here in case some other student stumbles across this thread while searching about caches.
The book is "Computer Architecture - A Quantitative Approach" 3rd edition by Hennesy and Patterson, page 390.
First, keep in mind that the main memory is divided into blocks for the cache.
If we have a 64 Bytes cache and 1 GB of RAM, the RAM would be divided into 128 KB blocks (1 GB of RAM / 64B of Cache = 128 KB Block size).
From the book:
Where can a block be placed in a cache?
If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The destination block is calculated using this formula: <RAM Block Address> MOD <Number of Blocks in the Cache>
So, let's assume we have 32 blocks of RAM and 8 blocks of cache.
If we want to store block 12 from RAM to the cache, RAM block 12 would be stored into Cache block 4. Why? Because 12 / 8 = 1 remainder 4. The remainder is the destination block.
If a block can be placed anywhere in the cache, the cache is said to be fully associative.
If a block can be placed anywhere in a restricted set of places in the cache, the cache is set associative.
Basically, a set is a group of blocks in the cache. A block is first mapped onto a set and then the block can be placed anywhere inside the set.
The formula is: <RAM Block Address> MOD <Number of Sets in the Cache>
So, let's assume we have 32 blocks of RAM and a cache divided into 4 sets (each set having two blocks, meaning 8 blocks in total). This way set 0 would have blocks 0 and 1, set 1 would have blocks 2 and 3, and so on...
If we want to store RAM block 12 into the cache, the RAM block would be stored in the Cache blocks 0 or 1. Why? Because 12 / 4 = 3 remainder 0. Therefore set 0 is selected and the block can be placed anywhere inside set 0 (meaning block 0 and 1).
Now I'll go back to my original problem with the addresses.
How is a block found if it is in the cache?
Each block frame in the cache has an address. Just to make it clear, a block has both address and data.
The block address is divided into multiple pieces: Tag, Index and Offset.
The tag is used to find the block inside the cache, the index only shows the set in which the block is situated (making it quite redundant) and the offset is used to select the data.
By "select the data" I mean that in a cache block there will obviously be more than one memory locations, the offset is used to select between them.
So, if you want to imagine a table, these would be the columns:
TAG | INDEX | OFFSET | DATA 1 | DATA 2 | ... | DATA N
Tag would be used to find the block, index would show in which set the block is, offset would select one of the fields to its right.
I hope that my understanding of this is correct, if it is not please let me know.

Resources