Direct Mapped Cache using Blocks - caching

I know this is a homework question and I am not asking for the answer. I would just like to understand this question, feel free to use other examples to explain.
The question that I need to answer is...
Each reference is a read of a 4-byte integer value and is described by the byte
address of that integer.
Assuming a 1KB, 16B block, direct-mapped cache, initially empty, fill in whether
each reference is a hit or a miss.
We are given a list of references that are 4 bytes. For example 0x00000000, 0x00000006, ...
From my understanding there are 64 blocks (1024/16) and each block is 16 bytes. When it looks at the first reference, it would be a miss and it would bring that into the cache. I know that it brings in the next reference in to the cache because each block would hold 16 bytes. Does this mean, on a miss, it brings in 4 references because each reference is 4 bytes?

Yes what you have understood is correct. When a byte is read from the memory, temporal locality suggests that the next few bytes will also be subsequently read. So caches usually have block size of more than 1 reference, in this case 4 references. During the next memory access, if the processor requests the very next reference, it's already there in the cache!
The memory address can be divided into two parts: block address and block offset. The block offset will be used to choose between these references which are on the same block of the cache. The other part, block address is further divided into tag and index fields. The index field is used to choose which set to access (in case of direct mapped cache, each cache block is 1 set). The tag field chooses the cache block from within the set.

#shailesh is right, but be careful with the word reference. The reference pattern depends on the program. Imagine the case where we write a C program that references a char array in a 16Byte stride. Here's a dumb routine that will do basically that:
void foo (char * x, int MAX) {
int i;
char a;
for (i = 0; i < MAX; i += 16)
a = x[i];
}
Suppose x is at address 0x00000000. This loop will then reference addresses
0x00000000, 0x00000010, 0x00000020, 0x00000030, and so on. In this case, after the very first reference to x[0], x[0] through x[15] will have been brought into the cache because of the 16B block size. But the next reference, which is x[16] has not. In other words, for your cache here, every reference in this loop will result in a cache miss.
You will find that when optimizing for performance, thinking about the machine's cache organization and behavior will help you avoid poor memory access patterns like this.

Related

How does Linux buddy allocator determine which size order list to insert into when freeing block?

The Binary Buddy Allocator used in Linux uses bitmaps where each bit corresponds to a state of a pair of buddy-blocks (taken from this article). And the void free_page(void *addr) call doesn't take a size of the allocated block that is to be freed.
I don't understand how this algorithm does determine into which order list to insert a block being freed without knowing its size. Let's imagine we allocated all available memory, so all the bitmaps are zeroed. Now we call free_page() with some address, and it's not clear into which list should the block be inserted.
Of course free_page(void *addr) or the equivalent __free_page(struct page *page) do not need a second argument: their purpose is to free one page only, and the size of a single page is known. The order is always 0 in this case.
Freeing a block of pages is done through __free_pages(struct page *page, unsigned int order), which does indeed take order as second argument.

Cache calculating block offset and index

I've read several topics about this theme but I could not get the answer. So my question is:
1) How is the block offset calculated?
I want to know not the formula but the concept of it. As I know it is quantity of cases which a block can store the address. For example If there is a block with 8 byte storage and has to store 2 byte addresses. Does its block offset is 2 bit?(So there is 4 cases to store the address (the diagram below might make easier to see what I am saying).
The block offset is simply calculated as log2 cache_line_size.
The reason is that all system that I know of are byte addressable. So you need enough bits to index any byte in the block. Although most systems have a word size that is larger than a single byte, they still support offsets of a single byte gradulatrity, even if that is not the common case.
So for the example you mentioned of an 8-byte block size with 2-byte word, you would still need 3 bits in order to allow accessing any byte. If you had a system that was not byte addressable then you could use just 2 bits for the block offset. But in practice all systems that I know of are byte addressable.

Calculating Cache Memory Hit and Miss, and Calculating Rows in Cache

I am studying an old exam for an upcoming exam, and the final questions consist of what the title describes. Now, I am familiar with assembly language instructions and I somewhat know what the code means. But, what the exam question actually wants me to do is confusing. I would really appreciate if someone could explain this question.
The question:
I am given a cache-memory which has room for 512 bytes and every row is 8 bytes long. The memory is direct-mapped and an "address" is 32 bits long. Also, the cache-memory is empty from the start.
After that, I get some instructions and am supposed to explain if it becomes a cache-hit or cache-miss. It should also be assumed that the instructions are all sequential and all data that is added/modified in an instruction still exists for the next instruction.
The instructions I get are
movia r8, 0xBEDA12C4
ldw r10, 0( r8 )
ldw r11, 8( r8 )
stw r10, 16( r8 )
ldw r10, 24(r8)
ldw r18, 32(r8)
Now I would really appreciate if someone could explain the details to me:
The cache-memory has room for a total of 512 bytes. What is this? Is it the total memory the cache is able to store? Also, I heard from somewhere that this is how you calculate rows in cache. For example, 512 bytes of memory and every row is 16 bytes. 512/16 = 32 rows in cache. For this example 512/8 = 64 rows. Which one is it? What does this mean!?
It also states that every row is 16 bytes long. I've seen the example with TAG, ROW, BYTE where they try to illustrate the cache. But how do I understand the 16 bytes per row? At least it doesn't seem to take part of the length on TAG, ROW, BYTE. What is this for?
Direct-mapped cache. I understand this somewhat. It's just a big row of slots of order which are empty or not, yeah? I found some information on this here.
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
*Updated link: https://web.archive.org/web/20150213025748/http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
Now to the main part. How do I calculate for each instruction if it will be a cache miss or hit? My guess is that the first instruction ought to be a miss, since the question said that the cache memory is empty from the start. The second instruction also must be a cache miss but from this point on I am not sure how to calculate if the instruction generates a cache hit or miss. To be honest, I am not even sure what a hit would be.
I would really appreciate if someone could show me how to calculate each step and how I know whether an instruction creates a cache hit or miss. The instructions we get for calculating this are really confusing. Thank you so much!
Generally you have to look at it as at a separate memory space, with only 512 bytes, addressable, readable and writable as arrays 8 bytes each. If you need byte 2, the address will be 0, you read the whole array and select byte 3 from it. If you need byte 8, the address will be 1, and you select byte 0 from the array. Such small memory have one huge advantage - it is fast. It alone can store the contents of some larger memory space, only first 512 bytes. If you store something to address 1 of larger memory space, it will go to that smaller memory instead, the address will become 0 and offset 1, internally for that small amount of memory. If you access beyond that, for example, 1000, you will have to wait more. In this case it would be just memory mapped "registers" - it would be actually faster and better in some cases, than "cache" - unfortunately for some reason processor makers generally won't let you use the cache in that way (probably marketing and support reasons - to sell other products as a separate market share, with higher price).
If you add some more space to each array to store some other value, you can store a part of address there. Without hardware support you could store there virtually anything, that second part is called tag. Now if you have some address fffff000, you can read the second space (assuming that you have the commands to do so), from address 0 - for simplicity and speed you can obtain the address from the primary memory space by masking all the bits except bits 3..8 and 0-2 (which are used to obtain offset in 8 byte array), and check the tag part from that address. One bit in that tag may be used to indicate whether there is something stored there, the other bits may be used to store the part of address from main memory. If you want to save something cached there, you set the bit indicating that the array is not "empty", and assign the upper bits of the main address there, and copy the 8 bytes from the main memory. Next time, before reading something within that range in memory, you read the tag part of the smaller memory array first, then decide whether to read from slow main memory, or from that smaller but faster part (and it would be cache hit).
If you write something with an address of (+-)x512 bytes in main memory, you would have to read the already mentioned array of 8 bytes, copy it into main memory, whole 8 bytes, and write what you want into the very same cell, and then modify the address with a new value. But you would lose the previous copy of your data in the smaller memory area (but faster). If you need the previous value again (any of those 8 bytes), you would have to copy it again from main memory (cache miss).
The same goes for all other arrays of that "cache" memory. So we have a sequence of cache checking, writing, reading and copying the data to or from main memory.
That is called 1 way associativity, for 2 ways there would be one more array (same) of 512 bytes, which can store different addresses though (with the step of 512 from main memory), the tags of those 2 arrays may be checked simultaneously, and if some array has the copy of that memory range it can return it instead of reading it from main memory. Without tag checking (extra cycles for that), the "cache" is essentially a small amount of memory.

Understanding Direct Mapped Cache

I'm trying to understand direct mapped cache, but it is a very complex concept. I have written what I think I understand so far, but I am unsure whether I am correct or not. Can somebody please verify if the explanation below is correct?
E.g, for a made up computer, just for the sake of this question, there 1024 memory locations (cells) in the RAM. This equals 2^10 so the address for each of these memory locations must be 10 bits long.
The CPU is asked to get data from the RAM memory address 1100100111. However the CPU doesn't access the data directly from this memory address in the RAM. The RAM stores this data to cache memory and then the CPU gets the data from the cache memory.
There are different ways of doing this, one being direct mapped cache. The cache memory and ram memory are divided up into blocks, where the number of cells in the blocks in each memory must be the same. The number of blocks in the RAM and cache must also be a power of 2.
In this example lets say there are 2^6 = 64 blocks in the RAM, so there are 1024/64 = 16 cells in each block. Lets say there are 2^2 = 4 blocks in the cache, so the cache has 64 cells. The "6" and "2" in the exponents of these numbers are important later on.
Because the The number of blocks in the RAM and cache is a power of 2, it makes the calculations easy. In our address 1100100111 the last 6 bits mark the offset 100111 (the 6 comes from the fact that 2^6 = 64), and the remaining 4 bits 1100 mark the RAM block number the data is stored in. Within this block number are two other important numbers. First the cache block number; this is the cache block that that RAM block would store to. This is the first 2 bits after the offset, so it will be 00 (The 2 comes from the fact that There are 2^2 = 4 blocks in the cache). The remaining 2 numbers in the address mark the tag. This will be 11.
So when the CPU is asked to get data from memory address 1100100111 it will look for this data in cache block number 00. It will compare the tag of the address 11 to the tag saved in the cache, which is a separate piece of memory used to store information about where from the RAM the data has come from. If the tags are the same this is a hit and this is the data the CPU is looking for. If the tag of the address and the tag in the memory are different, then this is a miss, and the data isn't stored in the cache.
If this is the case, the cache controller will get the data from block number 1100 in the RAM and store it in the cache block number 00, and update the tag in this block to 11. The CPU can now get the data in this block.
Is this all correct? I need to understand this before I can start to try and understand associative and set associative memory.
Thanks!
You have the right idea, but your numbers went wrong somewhere. In your example you have a direct-mapped cache of 4 blocks/lines of 16 bytes/cells each. The address 1100100111 will be divided up as follows. You use the least significant four bits 0111 as the offset because it refers to which cell of a particular block you want. I think you accidentally included the block number as part of the offset. Anyway, the next least significant two bits 10 will be the block number and the most significant four bits 1100 will be the tag.
Your understanding seems to be fine. One thing more that is necessary is a bit to indicate if the cache block is valid or not. Good luck with the associative stuff!

are array initialization operations cached as well

If you are not reading a value but assigning a value
for example
int array[] = new int[5];
for(int i =0; i < array.length(); i++){
array[i] = 2;
}
Still does the array come to the cache? Can't the cpu bring the array elements one by one to its registers and do the assignment and after that write the updated value to the main memory, bypasing the cache because its not necessary in this case ?
The answer depends on the cache protocol I answered assuming Write Back Write Allocate.
The array will still come to the cache and it will make a difference. When a cache block is retrieved from it's more than just a single memory location (the actual size depends on the design of the cache). So since arrays are stored in order in memory pulling in array[0] will pulling the rest of the block which will include (at least some of) array[1] array[2] array[3] and array[4]. This means the following calls will not have to access main memory.
Also after all this is done the values will NOT be written to memory immediately (under write back) instead the CPU will keep using the cache as the memory for reads/writes until that cache block is replaced from the cache at which point the values will be written to main memory.
Overall this is preferable to going to memory every time because the cache is much faster and the chances are the user is going to use the memory he just set relatively soon.
If the protocol is Write Through No Allocate then it won't bring the block into memory and it will right straight through to the main memory.

Resources