CPU cache hit time - memory-management

CPU cache hit time - memory-management

I am reading "computer organization and design the hardware/software interface" in spanish edition, and I have run into an exercise that I can not solve. The exercise is about memory hierarchy, specifically caches.
The exercise says:
If 2.5 ns are required to access labels in an N-way associative cache, 4 ns for access data, 1 ns for hit/failure comparison and 1 ns to return the data selected by the processor in case of success.
The critical path in a cache hit, is given by the time to determine whether there has been success or time data access?
What is the hit latency of the cache? (successful case).
What would be the latency of success in the cache if both the access time to labels and the data matrix is 3 ns?
I'll try to answer the questions with all I know about memories.
To access a data saved in the cache, the first thing I have to do is find the line using the index field of some address. Once the memory system have found the line, I need to compare the label field of my address with the label field of the cache. If they match, then it is a hit and I have to return the data, and displace an amount of data in the line determined by the offset field of the address and then return the data to the processor.
That implies that the cache will take 8.5 ns. But I have been thinking in another way that chaches can do it: if I get the desired line (2.5 ns) then now I can access de data, and in parallel, I can evaluate the condition of iquality. So, the time will be 4.5 ns. So, one of these are the result of the second question. Which of these results is correct?
For the first question, the critical path will be the operation that takes the larger amount of time; if the cache takes 4.5 to get the data, then the critical path will be access the labels in the cache - comparison - return the data. Otherwise, it will be the entire process.
For the last question, if the critical path is the entire process then it will take 8ns. Else, it will take 5ns (labels access in the cache, comparison, return the data).
This is true?, and what about a fully assoctive cache?, and a direct mapping cache?
The problem is that I do not know what things the cache do first and what next or in parallel.

If the text does'nt say anything about if it's a cache in a uniprocessor system/multiprocessor system or what the cache does in parallel you can safely assume that it performs the whole process in case of a cache hit. Instinctively I think it does'nt make sense to access the data and compare hit/miss in parallel, what if it's a miss? then then the data access is unneccessary and you increase the latency of the cache miss.
So then you get the following sequence in case of a cache hit:
Access label (2.5ns)
Compare hit/miss (1ns)
Access the data (4ns)
Return the data to the program requesting it (1ns)
Total: 2.5 + 1 + 4 +1 = 8.5ns
With this sequence we get (as you already now) the following answers to the questions:
Answer: The critical path in a cache hit is to access the data and return it 4+1=5 (ns), compared to determine wether the cache lookup was a success: 2.5 + 1 = 3.5 (ns)
Answer: 8.5ns
Answer: 3 + 1 + 3 + 1 = 8ns
If I get the desired line (2.5 ns) then now I can access de data, and
in parallel, I can evaluate the condition of iquality. So, the time
will be 4.5 ns
I don't see how you get 4.5ns? If you assume that the access of data and the hit/failure comparison is executed in parallel then you get: 2.5 + 4 + 1 = 7ns in case of a cache hit. You would also get 7ns in case of a cache miss, compared to if you don't access memory until you know if it's a cache miss, then you get a miss latency of 2.5 +1 = 3.5ns instead, which makes it very uneffective to try to parallelize the hit/miss comparison with data access.
If you assume that the access of the label and the hit/miss comparison is done in parallel with data access you get: 4 + 1 = 5ns in case of a cache hit.
Obviously you cannot return the data in parallel with fetching the data, but if you imagine that would be possible and you access the label and do comparison and return the data in parallel with accessing the data then you get: 2.5 + 1 + 1 = 4.5ns.
what about a fully assoctive cache?, and a direct mapping cache?
A N-way associative cache (as the question refers to) is a fully associative cache. This means that cache blocks can be placed anywhere in the cache. Hence it's very flexible, but it also means that when we want to lookup a memory address in the cache we need to compare the tag with every block in the cache to know if the memory address we're looking for is cached or not. Consequently we get slower lookup time.
In a direct mapped cache, every cache block can only go in one spot in the cache. That spot is computed by looking at the memory address and computing the index-part of the address. Thus a direct-mapped cache can give very quick lookups but is not very flexible. Depending on the cache size, cache blocks can be replaced very often.
The terminology in the question is a bit confusing, "label" is usually called "tag" when speaking of cpu-caches.

Related

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?

This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

How can I force an L2 cache miss?

I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?

You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W), and p(miss) = 1 - p(hit) where p(hit) and p(miss) are the probabilities of a hit and miss, C is the relevant cache size, and W is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss) will never be 100%, since C/W only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10 accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.

Understanding caches and block sizes

A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.

Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.

Memory Access time in 2 level Paging

Consider a system with a two-level paging scheme in which a regular memory
access takes 150 nanoseconds, and servicing a page fault takes 8 ms.
An average instruction takes 100 ns of CPU time, and two memory
accesses. The TLB hit ratio is 90%, and the page fault rate is one in every 10,000
instructions. What is the effective average instruction execution time?
This was asked in GATE 2004. To solve the question, I would follow the below concept :
T(memory access avg) = .90(150) + .1(150+150+150) = 180
(150- level1, 150-level2 and 150-memory)
T effective = 100+ 2* 180 + 1/10000* 8* 10^6 = 1260.
Is this approach correct ? Also I have the following doubts :
There won't be a page fault when there is a TLB hit because the most
frequently used pages has to be in the memory. Is it correct ?
What is the size of the page table for a process? Say for a 32 bit
virtual address, for every process do we allocate a page-table with
2^32 entries ? How is the memory limits managed in paging ?
Please explain theses concepts.

I would suggest the following
here 100ns for instruction execution (no difference of opinion there)
Now given TLB hit ratio is 90%, so whenever there is a TLB miss, we have to do 2 memory accesses, since it is given a 2 level paging scheme.
and irrespective of TLB hit or miss 2*(150+ 8*10^6 * 1/20000 ) should be done which is memory access time for contents and overhead for page fault.
I think your expression assumes, that for an instruction whenever a TLB hit occurs for first content, it follows for the second
so you assume hit-hit or miss-miss,while since given TLB hit is 90%(per access and not per instruction), I feel there should be all 4 possible combinations
hit-hit, miss-miss, hit-miss,miss-hit

How does direct mapped cache work?

I am taking a System Architecture course and I have trouble understanding how a direct mapped cache works.
I have looked in several places and they explain it in a different manner which gets me even more confused.
What I cannot understand is what is the Tag and Index, and how are they selected?
The explanation from my lecture is:
"Address divided is into two parts
index (e.g 15 bits) used to address (32k) RAMs directly
Rest of address, tag is stored and compared with incoming tag. "
Where does that tag come from? It cannot be the full address of the memory location in RAM since it renders direct mapped cache useless (when compared with the fully associative cache).
Thank you very much.

Okay. So let's first understand how the CPU interacts with the cache.
There are three layers of memory (broadly speaking) - cache (generally made of SRAM chips), main memory (generally made of DRAM chips), and storage (generally magnetic, like hard disks). Whenever CPU needs any data from some particular location, it first searches the cache to see if it is there. Cache memory lies closest to the CPU in terms of memory hierarchy, hence its access time is the least (and cost is the highest), so if the data CPU is looking for can be found there, it constitutes a 'hit', and data is obtained from there for use by CPU. If it is not there, then the data has to be moved from the main memory to the cache before it can be accessed by the CPU (CPU generally interacts only with the cache), that incurs a time penalty.
So to find out whether the data is there or not in the cache, various algorithms are applied. One is this direct mapped cache method. For simplicity, let's assume a memory system where there are 10 cache memory locations available (numbered 0 to 9), and 40 main memory locations available (numbered 0 to 39). This picture sums it up:
There are 40 main memory locations available, but only upto 10 can be accommodated in the cache. So now, by some means, the incoming request from CPU needs to be redirected to a cache location. That has two problems:
How to redirect? Specifically, how to do it in a predictable way which will not change over time?
If the cache location is already filled up with some data, the incoming request from CPU has to identify whether the address from which it requires the data is same as the address whose data is stored in that location.
In our simple example, we can redirect by a simple logic. Given that we have to map 40 main memory locations numbered serially from 0 to 39 to 10 cache locations numbered 0 to 9, the cache location for a memory location n can be n%10. So 21 corresponds to 1, 37 corresponds to 7, etc. That becomes the index.
But 37, 17, 7 all correspond to 7. So to differentiate between them, comes the tag. So just like index is n%10, tag is int(n/10). So now 37, 17, 7 will have the same index 7, but different tags like 3, 1, 0, etc. That is, the mapping can be completely specified by the two data - tag and index.
So now if a request comes for address location 29, that will translate to a tag of 2 and index of 9. Index corresponds to cache location number, so cache location no. 9 will be queried to see if it contains any data, and if so, if the associated tag is 2. If yes, it's a CPU hit and the data will be fetched from that location immediately. If it is empty, or the tag is not 2, it means that it contains the data corresponding to some other memory address and not 29 (although it will have the same index, which means it contains a data from address like 9, 19, 39, etc.). So it is a CPU miss, and data from location no. 29 in main memory will have to be loaded into the cache at location 9 (and the tag changed to 2, and deleting any data which was there before), after which it will be fetched by CPU.

Lets use an example. A 64 kilobyte cache, with 16 byte cache-lines has 4096 different cache lines.
You need to break the address down into three different parts.
The lowest bits are used to tell you the byte within a cache line when you get it back, this part isn't directly used in the cache lookup. (bits 0-3 in this example)
The next bits are used to INDEX the cache. If you think of the cache as a big column of cache lines, the index bits tell you which row you need to look in for your data. (bits 4-15 in this example)
All the other bits are TAG bits. These bits are stored in the tag store for the data you have stored in the cache, and we compare the corresponding bits of the cache request to what we have stored to figure out if the data we are cacheing are the data that are being requested.
The number of bits you use for the index is log_base_2(number_of_cache_lines) [it's really the number of sets, but in a direct mapped cache, there are the same number of lines and sets]

A direct mapped cache is like a table that has rows also called cache line and at least 2 columns one for the data and the other one for the tags.
Here is how it works: A read access to the cache takes the middle part of the address that is called index and use it as the row number. The data and the tag are looked up at the same time.
Next, the tag needs to be compared with the upper part of the address to decide if the line is from the same address range in memory and is valid. At the same time, the lower part of the address can be used to select the requested data from cache line (I assume a cache line can hold data for several words).
I emphasized a little on data access and tag access+compare happens at the same time, because that is key to reduce the latency (purpose of a cache). The data path ram access doesn't need to be two steps.
The advantage is that a read is basically a simple table lookup and a compare.
But it is direct mapped that means for every read address there is exactly one place in the cache where this data could be cached. So the disadvantage is that a lot of other addresses would be mapped to the same place and may compete for this cache line.

I have found a good book at the library that has offered me the clear explanation I needed and I will now share it here in case some other student stumbles across this thread while searching about caches.
The book is "Computer Architecture - A Quantitative Approach" 3rd edition by Hennesy and Patterson, page 390.
First, keep in mind that the main memory is divided into blocks for the cache.
If we have a 64 Bytes cache and 1 GB of RAM, the RAM would be divided into 128 KB blocks (1 GB of RAM / 64B of Cache = 128 KB Block size).
From the book:
Where can a block be placed in a cache?
If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The destination block is calculated using this formula: <RAM Block Address> MOD <Number of Blocks in the Cache>
So, let's assume we have 32 blocks of RAM and 8 blocks of cache.
If we want to store block 12 from RAM to the cache, RAM block 12 would be stored into Cache block 4. Why? Because 12 / 8 = 1 remainder 4. The remainder is the destination block.
If a block can be placed anywhere in the cache, the cache is said to be fully associative.
If a block can be placed anywhere in a restricted set of places in the cache, the cache is set associative.
Basically, a set is a group of blocks in the cache. A block is first mapped onto a set and then the block can be placed anywhere inside the set.
The formula is: <RAM Block Address> MOD <Number of Sets in the Cache>
So, let's assume we have 32 blocks of RAM and a cache divided into 4 sets (each set having two blocks, meaning 8 blocks in total). This way set 0 would have blocks 0 and 1, set 1 would have blocks 2 and 3, and so on...
If we want to store RAM block 12 into the cache, the RAM block would be stored in the Cache blocks 0 or 1. Why? Because 12 / 4 = 3 remainder 0. Therefore set 0 is selected and the block can be placed anywhere inside set 0 (meaning block 0 and 1).
Now I'll go back to my original problem with the addresses.
How is a block found if it is in the cache?
Each block frame in the cache has an address. Just to make it clear, a block has both address and data.
The block address is divided into multiple pieces: Tag, Index and Offset.
The tag is used to find the block inside the cache, the index only shows the set in which the block is situated (making it quite redundant) and the offset is used to select the data.
By "select the data" I mean that in a cache block there will obviously be more than one memory locations, the offset is used to select between them.
So, if you want to imagine a table, these would be the columns:
TAG | INDEX | OFFSET | DATA 1 | DATA 2 | ... | DATA N
Tag would be used to find the block, index would show in which set the block is, offset would select one of the fields to its right.
I hope that my understanding of this is correct, if it is not please let me know.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio