Look Through vs Look aside - caching

Suppose there are 2 caches L1 and L2
L1
Hit rate of L1=0.8
Access time of l1=2ns
and transfer time b/w L1 and CPU is 10ns
L2
Hit rate of L2=0.9
Access time of L2 =5ns
and transfer time b/w L2 and L1 is 100ns
What will be the effective access time in case of Look through and Look aside policies.

Look through and Look aside is the read policy of cache architecture.
First , We will see difference between them
(1) - LOOK THROUGH Policy = If processor wants to search content , it will first look into cache , if cache hits -- get content , if cache miss (here it will search into L2 and then go to main memory) it will go to main memory , read block from main memory and copy block into cache for further access...
Here , To calculate Access time
h = hit rate
c = cache access time
m = main memory access time
Access time = h * c + (1 - h ) * ( c + m )
for L1 = 2 + 10 = 12 ns
for (through L1) L2 = L1 time + 5 + 100 = 117 ns
for (through L1 + L2 ) memory = L1 + L2 + Mem = Mem ns
Access time = (0.8 * 12 ) + (0.18 * 117) + (0.02 * Mem ).
(2) LOOK ASIDE policy = Processor simultaneously look for content in both cache as well as in main memory....
Look aside requires more signal operation for every access(cache and main memory) and when content found in cache , it require to send a cancel signal to main memory..which is biggest disadvantage of look aside policy..
Here , To calculate Access time
you have to consider all signaling time for all operation ....
Note - Most of cache uses look through cache , because now a days , cache hit ratio is more than 95% ..so most of time content is available in cache....

[For software/application cache]In both, look-aside and look-through caches, the data is looked up first in the cache. In the look-aside case, it is the responsibility of the application to maintain the consistency of the data in the cache and insert the data back into cache, whereas in the look-through case, the consistency is handled transparently by the cache, without the application being involved.
This means that for look-aside cache, the application sends the request to the main memory, while in look-through cache the request is forwarded from the cache itself.
See the slides 14 and 15 in this slide deck for the visual illustration: https://www.cs.princeton.edu/courses/archive/fall19/cos316/lectures/08-caching.pdf

Related

Finding average memory access time, AMAT and global miss rate

I'm quite confused about this question. So, I have IL1, DL1 and UL2 and when I try to find AMAT do I use the formula AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * Miss Penalty ? or Do I also add Hit time(3) because there are 3 miss rates
For Example: 0.4 + 0.1 * (0.8 + 0.05 * (10 + 0.02 * 48))
I used AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * (Hit time(3) + Miss Rate * Miss Penalty))
Here is the Table, and also Frequency is 2.5 GHZ and It is also provided that 20% of all instructions are of load/store type.
By the way are there also a way to find global miss rate of UL2 in %? I'm also quite stuck on that one too.
There are two different cache hierarchies to consider.  I cannot tell from your question post if you're trying to compute AMAT for just data operations (load & store) or for instruction access + data operations (20% of the them).
The hierarchies:
Instruction Cache: IL1 backed by UL2 backed by Main Memory
Data Cache: DL1 backed by UL2 backed by Main Memory
There is a stated hit time & miss rate associated with each individual cache, and, this is necessary because the caches are of different construction and size (and also in different positions in the hierarchy).
All instructions participate in accessing of the Instruction Cache, so hit/miss there applies to every instruction regardless of the nature or type of the instruction.  So, you can compute the AMAT for instruction access alone generally using the IL1->UL2->Main Memory hierarchy — be sure to use the specific hit time and miss rate for each given level in the hierarchy: 1clk & 10% for IL1; 25clk & 2% for UL2; and 120clk & 0% for Main Memory.
20% of the instructions participate in accessing of the Data Cache.
Of those that do data accesses, you can compute that component of AMAT using the DL1->UL2->Main Memory hierarchy — here you have DL1 with 2clk & 5%; UL2 with 25clk & 2%; and Main Memory with 120clk & 0%.
These numbers can be combined to an overall value that accounts for 100% of the instructions incurring the instructions cache hierarchy AMAT, and 20% of them incurring the data cache hierarchy AMAT.
As needed you can convert AMAT in cycles/clocks to AMAT in (nano) seconds.

how do you find the miss penalty of a single level cache?

I am attempting to find the average memory access time (AMAT) of a single level cache. In order to do so, miss penalty must be calculated since the AMAT formula requires it.
Doing this for a multilevel cache requires using the next level cache penalty. But for a single level, there is obviously no other cache level.
So how is this calculated?
formula:
AMAT = HIT-TIME + MISS-RATE * MISS-PENALTY
You have the correct formula to calculate the AMAT, however you may be misinterpreting the components of the formula. Let’s take a look at how to use this equation, first with a single-level cache and next with a multi-level cache.
Suppose you have a single-level cache. Hit time represents the amount of time required to search and retrieve data from the cache. Miss rate denotes the percentage of the data requested that does not reside in the cache i.e. the percentage of data you have to go to main memory to retrieve. Miss penalty is the amount of time required to retrieve the data once you miss in the cache. Because we are dealing with a single-level cache, the only other level in the memory hierarchy to consider is main memory for the miss penalty.
Here’s a good example for single-level cache:
L1 cache has an access time of 5ns and a miss rate of 50%
Main memory has an access time of 500ns
AMAT = 5ns + 0.5 * 500ns = 255ns
You always check the cache first so you always incur a 5 ns hit time overhead. Because our miss rate is 0.5, we find what we are looking for in the L1 cache half the time and must go to main memory the remaining half time. You can calculate the miss penalty in the following way using a weighted average:
(0.5 * 0ns) + (0.5 * 500ns) = (0.5 * 500ns) = 250ns.
Now, suppose you have a multi-level cache i.e. L1 and L2 cache. Hit time now represents the amount of time to retrieve data in the L1 cache. Miss rate is an indication of how often we miss in the L1 cache. Calculating the miss penalty in a multi-level cache is not as straightforward as before because we need to consider the time required to read data from the L2 cache as well as how often we miss in the L2 cache.
Here’s a good example:
L1 cache has an access time of 5 ns and miss rate of 50%
L2 cache has an access time of 50 ns and miss rate of 20%
Main memory has an access time of 500 ns
AMAT = 5ns + 0.5 * (50ns + 0.2 * 500ns) = 80 ns
Again, you always check the L1 cache first so you always incur a 5 ns hit time overhead. Because our miss rate is 0.5, we find what we are looking for in the L1 cache half the time and must down the memory hierarchy (L2 cache, main memory) the remaining half time. If we do not find the data in the L1 cache, we always look in the L2 cache next. We thus incur a 50 ns hit time overhead every time we miss in the L1 cache. In the case that the data is not in the L2 cache also (which is 20% of the time), we must go to main memory which has a memory access time of 500 ns.

CPU cache hit time

I am reading "computer organization and design the hardware/software interface" in spanish edition, and I have run into an exercise that I can not solve. The exercise is about memory hierarchy, specifically caches.
The exercise says:
If 2.5 ns are required to access labels in an N-way associative cache, 4 ns for access data, 1 ns for hit/failure comparison and 1 ns to return the data selected by the processor in case of success.
The critical path in a cache hit, is given by the time to determine whether there has been success or time data access?
What is the hit latency of the cache? (successful case).
What would be the latency of success in the cache if both the access time to labels and the data matrix is 3 ns?
I'll try to answer the questions with all I know about memories.
To access a data saved in the cache, the first thing I have to do is find the line using the index field of some address. Once the memory system have found the line, I need to compare the label field of my address with the label field of the cache. If they match, then it is a hit and I have to return the data, and displace an amount of data in the line determined by the offset field of the address and then return the data to the processor.
That implies that the cache will take 8.5 ns. But I have been thinking in another way that chaches can do it: if I get the desired line (2.5 ns) then now I can access de data, and in parallel, I can evaluate the condition of iquality. So, the time will be 4.5 ns. So, one of these are the result of the second question. Which of these results is correct?
For the first question, the critical path will be the operation that takes the larger amount of time; if the cache takes 4.5 to get the data, then the critical path will be access the labels in the cache - comparison - return the data. Otherwise, it will be the entire process.
For the last question, if the critical path is the entire process then it will take 8ns. Else, it will take 5ns (labels access in the cache, comparison, return the data).
This is true?, and what about a fully assoctive cache?, and a direct mapping cache?
The problem is that I do not know what things the cache do first and what next or in parallel.
If the text does'nt say anything about if it's a cache in a uniprocessor system/multiprocessor system or what the cache does in parallel you can safely assume that it performs the whole process in case of a cache hit. Instinctively I think it does'nt make sense to access the data and compare hit/miss in parallel, what if it's a miss? then then the data access is unneccessary and you increase the latency of the cache miss.
So then you get the following sequence in case of a cache hit:
Access label (2.5ns)
Compare hit/miss (1ns)
Access the data (4ns)
Return the data to the program requesting it (1ns)
Total: 2.5 + 1 + 4 +1 = 8.5ns
With this sequence we get (as you already now) the following answers to the questions:
Answer: The critical path in a cache hit is to access the data and return it 4+1=5 (ns), compared to determine wether the cache lookup was a success: 2.5 + 1 = 3.5 (ns)
Answer: 8.5ns
Answer: 3 + 1 + 3 + 1 = 8ns
If I get the desired line (2.5 ns) then now I can access de data, and
in parallel, I can evaluate the condition of iquality. So, the time
will be 4.5 ns
I don't see how you get 4.5ns? If you assume that the access of data and the hit/failure comparison is executed in parallel then you get: 2.5 + 4 + 1 = 7ns in case of a cache hit. You would also get 7ns in case of a cache miss, compared to if you don't access memory until you know if it's a cache miss, then you get a miss latency of 2.5 +1 = 3.5ns instead, which makes it very uneffective to try to parallelize the hit/miss comparison with data access.
If you assume that the access of the label and the hit/miss comparison is done in parallel with data access you get: 4 + 1 = 5ns in case of a cache hit.
Obviously you cannot return the data in parallel with fetching the data, but if you imagine that would be possible and you access the label and do comparison and return the data in parallel with accessing the data then you get: 2.5 + 1 + 1 = 4.5ns.
what about a fully assoctive cache?, and a direct mapping cache?
A N-way associative cache (as the question refers to) is a fully associative cache. This means that cache blocks can be placed anywhere in the cache. Hence it's very flexible, but it also means that when we want to lookup a memory address in the cache we need to compare the tag with every block in the cache to know if the memory address we're looking for is cached or not. Consequently we get slower lookup time.
In a direct mapped cache, every cache block can only go in one spot in the cache. That spot is computed by looking at the memory address and computing the index-part of the address. Thus a direct-mapped cache can give very quick lookups but is not very flexible. Depending on the cache size, cache blocks can be replaced very often.
The terminology in the question is a bit confusing, "label" is usually called "tag" when speaking of cpu-caches.

Calculating actual/effective CPI for 3 level cache

(a) You are given a memory system that has two levels of cache (L1 and L2). Following are the specifications:
Hit time of L1 cache: 2 clock cycles
Hit rate of L1 cache: 92%
Miss penalty to L2 cache (hit time of L2): 8 clock cycles
Hit rate of L2 cache: 86%
Miss penalty to main memory: 37 clock cycles
Assume for the moment that hit rate of main memory is 100%.
Given a 2000 instruction program with 37% data transfer instructions (loads/stores), calculate the CPI (Clock Cycles per Instruction) for this scenario.
For this part, I calculated it like this (am I doing this right?):
(m1: miss rate of L1, m2: miss rate of L2)
AMAT = HitTime_L1 + m1*(HitTime_L2 + m2*MissPenalty_L2)
CPI(actual) = CPI(ideal) + (AMAT - CPI(ideal))*AverageMemoryAccess
(b) Now lets add another level of cache, i.e., L3 cache between the L2 cache and the main memory. Consider the following:
Miss penalty to L3 cache (hit time of L3 cache): 13 clock cycles
Hit rate of L3 cache: 81%
Miss penalty to main memory: 37 clock cycles
Other specifications remain as part (a)
For the same 2000 instruction program (which has 37% data transfer instructions), calculate the CPI.
(m1: miss rate of L1, m2: miss rate of L2, m3: miss rate of L3)
AMAT = HitTime_L1
+ m1*(HitTime_L2 + m2*MissPenalty_L2)
+ m2*(HitTime_L3 + m3*MissPenalty_L3)
Is this formula correct and where do I add the miss penalty to main memory in this formula?
It should probably be added with the miss penalty of L3 but I am not sure.
(a) The AMAT calculation is correct if you notice that the MissPenalty_L2 parameter is what you called Miss penalty to main memory.
The CPI is a bit more difficult.
First of all, let's assume that the CPU is not pipelined (sequential processor).
There are 1.37 memory accesses per instruction (one access to fetch the instruction and 0.37 due to data transfer instructions). The ideal case is that all memory acceses hit in the L1 cache.
So, knowing that:
CPI(ideal) = CPI(computation) + CPI(mem) =
CPI(computation) + Memory_Accesses_per_Instruction*HitTime_L1 =
CPI(computation) + 1.37*HitTime_L1
With real memory, the average memory access time is AMAT, so:
CPI(actual) = CPI(computation) + Memory_Accesses_per_Instruction*AMAT =
CPI(ideal) + Memory_Accesses_per_Instruction*(AMAT - HitTime_L1) =
CPI(ideal) + 1.37*(AMAT - HitTime_L1)
(b) Your AMAT calculation is wrong. After a miss at L2, it follows a L3 access that can be a hit or a miss. Try to finish the exercise yourself.

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Resources