Cache Memory Blocks Organization - caching

I am not able to understand how exactly the cache is organized in the following scenario.
The cache size is 256 bytes. The cache line size is 8 bytes. All variables are 4 bytes. Assume that an array A[1024] is stored in memory locations 0-4095. Suppose if we are using fully associative mapping technique, how is the array mapped to this particular cache ? Consider that the cache is initially empty and we use LRU algorithm for replacement. During each replacement, an entire line of cache is replaced.
Initial analysis :
There will be 32 cache blocks each with 8 bytes length. But the variables to be stored in these locations is only 4 bytes long. I am not able to take this analysis any further as to how these array elements are mapped to the 32 cache blocks.

Let's assume it's accessed sequentially:
for (int i=0; i<1024; ++i)
read(A[i]);
In that case, you'll fill the first 64 elements (A[0] through A[63]) into the 32 cache blocks in adjacent pairs like MSalters said.
The next access would have to kick out the least recently used line, which, since you access the array in sequential order is A[64]. It would have to pick a victim to kick out, and since you're using LRU that would be the first block (way 0). You therefore replace A[0] and A[1] with A[64] and A[65] and so on, so in general you'll have element i mapped into way floor(i/2)%32.
Now computing the hit rate requires an additional assumption - each memory line fetched is the size of a full block (8 bytes), since you can't fill half blocks (actually there are ways using mask bits, but let's assume the simple case). We therefore get each second element "for free" - fetching A[0] would also fetch A[1] and so on. In theory this means that the hit rate could be 50% (miss even elements, hit odds, in reality most CPUs would perform the accesses in parallel so you won't really have that hit rate, but let's say the accesses are serialized here).
Note that each new block fetched after the first 64 elements would have to evict a block from the cache, if processing the elements also modifies them you'll have to write them back too.

Elements A[0] and A[1] are stored in adjacent memory locations, 0-4 and 4-8. That means they share the first cache block. The other elements are similarly mapped pairwise to a cache line. Which pair goes where?

Related

index and tag switching position

Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?
This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory.  Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block.  How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity.  If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2.  Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before.  This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.

How does Direct mapped cache implement spatial locality?

An array is declared as floatA[2048] . Each array element is 4Bytes in size.This program is run on a computer that has a
direct mapped data cache of size 8Kbytes, with block (line) size of 16Bytes.
Which elements of the array conflict with element A[0] in the data cache?
Ultimately A[0],A[512],A[1024],A[1536] map to cache block 0
As per my understanding, when A[0] is required for the first time, A[0],A[1],A[2],A[3](since one cache block can hold 4 elements) are brought into the cache and placed in cache blocks 0, 1, 2,and 3 respectively.
Other approach would be to bring only A[0] and place it in cache block 0. (Spatial locality not used here)
What is the general practice in such a scenario?
All four elements A[3:0] are stored in cache block 0 - since these 4 elements together form 16B. Depending on how the hardware system is set up, the next 16B are then stored on cache block 1 (the decision of which cache line (16B contiguous data granule) maps into which set is made while designing hardware and is based on certain bits of the address.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Understanding Direct Mapped Cache

I'm trying to understand direct mapped cache, but it is a very complex concept. I have written what I think I understand so far, but I am unsure whether I am correct or not. Can somebody please verify if the explanation below is correct?
E.g, for a made up computer, just for the sake of this question, there 1024 memory locations (cells) in the RAM. This equals 2^10 so the address for each of these memory locations must be 10 bits long.
The CPU is asked to get data from the RAM memory address 1100100111. However the CPU doesn't access the data directly from this memory address in the RAM. The RAM stores this data to cache memory and then the CPU gets the data from the cache memory.
There are different ways of doing this, one being direct mapped cache. The cache memory and ram memory are divided up into blocks, where the number of cells in the blocks in each memory must be the same. The number of blocks in the RAM and cache must also be a power of 2.
In this example lets say there are 2^6 = 64 blocks in the RAM, so there are 1024/64 = 16 cells in each block. Lets say there are 2^2 = 4 blocks in the cache, so the cache has 64 cells. The "6" and "2" in the exponents of these numbers are important later on.
Because the The number of blocks in the RAM and cache is a power of 2, it makes the calculations easy. In our address 1100100111 the last 6 bits mark the offset 100111 (the 6 comes from the fact that 2^6 = 64), and the remaining 4 bits 1100 mark the RAM block number the data is stored in. Within this block number are two other important numbers. First the cache block number; this is the cache block that that RAM block would store to. This is the first 2 bits after the offset, so it will be 00 (The 2 comes from the fact that There are 2^2 = 4 blocks in the cache). The remaining 2 numbers in the address mark the tag. This will be 11.
So when the CPU is asked to get data from memory address 1100100111 it will look for this data in cache block number 00. It will compare the tag of the address 11 to the tag saved in the cache, which is a separate piece of memory used to store information about where from the RAM the data has come from. If the tags are the same this is a hit and this is the data the CPU is looking for. If the tag of the address and the tag in the memory are different, then this is a miss, and the data isn't stored in the cache.
If this is the case, the cache controller will get the data from block number 1100 in the RAM and store it in the cache block number 00, and update the tag in this block to 11. The CPU can now get the data in this block.
Is this all correct? I need to understand this before I can start to try and understand associative and set associative memory.
Thanks!
You have the right idea, but your numbers went wrong somewhere. In your example you have a direct-mapped cache of 4 blocks/lines of 16 bytes/cells each. The address 1100100111 will be divided up as follows. You use the least significant four bits 0111 as the offset because it refers to which cell of a particular block you want. I think you accidentally included the block number as part of the offset. Anyway, the next least significant two bits 10 will be the block number and the most significant four bits 1100 will be the tag.
Your understanding seems to be fine. One thing more that is necessary is a bit to indicate if the cache block is valid or not. Good luck with the associative stuff!

Understanding caches and block sizes

A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.
Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.

Resources