Calculating Cache Memory Hit and Miss, and Calculating Rows in Cache - caching

I am studying an old exam for an upcoming exam, and the final questions consist of what the title describes. Now, I am familiar with assembly language instructions and I somewhat know what the code means. But, what the exam question actually wants me to do is confusing. I would really appreciate if someone could explain this question.
The question:
I am given a cache-memory which has room for 512 bytes and every row is 8 bytes long. The memory is direct-mapped and an "address" is 32 bits long. Also, the cache-memory is empty from the start.
After that, I get some instructions and am supposed to explain if it becomes a cache-hit or cache-miss. It should also be assumed that the instructions are all sequential and all data that is added/modified in an instruction still exists for the next instruction.
The instructions I get are
movia r8, 0xBEDA12C4
ldw r10, 0( r8 )
ldw r11, 8( r8 )
stw r10, 16( r8 )
ldw r10, 24(r8)
ldw r18, 32(r8)
Now I would really appreciate if someone could explain the details to me:
The cache-memory has room for a total of 512 bytes. What is this? Is it the total memory the cache is able to store? Also, I heard from somewhere that this is how you calculate rows in cache. For example, 512 bytes of memory and every row is 16 bytes. 512/16 = 32 rows in cache. For this example 512/8 = 64 rows. Which one is it? What does this mean!?
It also states that every row is 16 bytes long. I've seen the example with TAG, ROW, BYTE where they try to illustrate the cache. But how do I understand the 16 bytes per row? At least it doesn't seem to take part of the length on TAG, ROW, BYTE. What is this for?
Direct-mapped cache. I understand this somewhat. It's just a big row of slots of order which are empty or not, yeah? I found some information on this here.
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
*Updated link: https://web.archive.org/web/20150213025748/http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
Now to the main part. How do I calculate for each instruction if it will be a cache miss or hit? My guess is that the first instruction ought to be a miss, since the question said that the cache memory is empty from the start. The second instruction also must be a cache miss but from this point on I am not sure how to calculate if the instruction generates a cache hit or miss. To be honest, I am not even sure what a hit would be.
I would really appreciate if someone could show me how to calculate each step and how I know whether an instruction creates a cache hit or miss. The instructions we get for calculating this are really confusing. Thank you so much!

Generally you have to look at it as at a separate memory space, with only 512 bytes, addressable, readable and writable as arrays 8 bytes each. If you need byte 2, the address will be 0, you read the whole array and select byte 3 from it. If you need byte 8, the address will be 1, and you select byte 0 from the array. Such small memory have one huge advantage - it is fast. It alone can store the contents of some larger memory space, only first 512 bytes. If you store something to address 1 of larger memory space, it will go to that smaller memory instead, the address will become 0 and offset 1, internally for that small amount of memory. If you access beyond that, for example, 1000, you will have to wait more. In this case it would be just memory mapped "registers" - it would be actually faster and better in some cases, than "cache" - unfortunately for some reason processor makers generally won't let you use the cache in that way (probably marketing and support reasons - to sell other products as a separate market share, with higher price).
If you add some more space to each array to store some other value, you can store a part of address there. Without hardware support you could store there virtually anything, that second part is called tag. Now if you have some address fffff000, you can read the second space (assuming that you have the commands to do so), from address 0 - for simplicity and speed you can obtain the address from the primary memory space by masking all the bits except bits 3..8 and 0-2 (which are used to obtain offset in 8 byte array), and check the tag part from that address. One bit in that tag may be used to indicate whether there is something stored there, the other bits may be used to store the part of address from main memory. If you want to save something cached there, you set the bit indicating that the array is not "empty", and assign the upper bits of the main address there, and copy the 8 bytes from the main memory. Next time, before reading something within that range in memory, you read the tag part of the smaller memory array first, then decide whether to read from slow main memory, or from that smaller but faster part (and it would be cache hit).
If you write something with an address of (+-)x512 bytes in main memory, you would have to read the already mentioned array of 8 bytes, copy it into main memory, whole 8 bytes, and write what you want into the very same cell, and then modify the address with a new value. But you would lose the previous copy of your data in the smaller memory area (but faster). If you need the previous value again (any of those 8 bytes), you would have to copy it again from main memory (cache miss).
The same goes for all other arrays of that "cache" memory. So we have a sequence of cache checking, writing, reading and copying the data to or from main memory.
That is called 1 way associativity, for 2 ways there would be one more array (same) of 512 bytes, which can store different addresses though (with the step of 512 from main memory), the tags of those 2 arrays may be checked simultaneously, and if some array has the copy of that memory range it can return it instead of reading it from main memory. Without tag checking (extra cycles for that), the "cache" is essentially a small amount of memory.

Related

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Understanding Direct Mapped Cache

I'm trying to understand direct mapped cache, but it is a very complex concept. I have written what I think I understand so far, but I am unsure whether I am correct or not. Can somebody please verify if the explanation below is correct?
E.g, for a made up computer, just for the sake of this question, there 1024 memory locations (cells) in the RAM. This equals 2^10 so the address for each of these memory locations must be 10 bits long.
The CPU is asked to get data from the RAM memory address 1100100111. However the CPU doesn't access the data directly from this memory address in the RAM. The RAM stores this data to cache memory and then the CPU gets the data from the cache memory.
There are different ways of doing this, one being direct mapped cache. The cache memory and ram memory are divided up into blocks, where the number of cells in the blocks in each memory must be the same. The number of blocks in the RAM and cache must also be a power of 2.
In this example lets say there are 2^6 = 64 blocks in the RAM, so there are 1024/64 = 16 cells in each block. Lets say there are 2^2 = 4 blocks in the cache, so the cache has 64 cells. The "6" and "2" in the exponents of these numbers are important later on.
Because the The number of blocks in the RAM and cache is a power of 2, it makes the calculations easy. In our address 1100100111 the last 6 bits mark the offset 100111 (the 6 comes from the fact that 2^6 = 64), and the remaining 4 bits 1100 mark the RAM block number the data is stored in. Within this block number are two other important numbers. First the cache block number; this is the cache block that that RAM block would store to. This is the first 2 bits after the offset, so it will be 00 (The 2 comes from the fact that There are 2^2 = 4 blocks in the cache). The remaining 2 numbers in the address mark the tag. This will be 11.
So when the CPU is asked to get data from memory address 1100100111 it will look for this data in cache block number 00. It will compare the tag of the address 11 to the tag saved in the cache, which is a separate piece of memory used to store information about where from the RAM the data has come from. If the tags are the same this is a hit and this is the data the CPU is looking for. If the tag of the address and the tag in the memory are different, then this is a miss, and the data isn't stored in the cache.
If this is the case, the cache controller will get the data from block number 1100 in the RAM and store it in the cache block number 00, and update the tag in this block to 11. The CPU can now get the data in this block.
Is this all correct? I need to understand this before I can start to try and understand associative and set associative memory.
Thanks!
You have the right idea, but your numbers went wrong somewhere. In your example you have a direct-mapped cache of 4 blocks/lines of 16 bytes/cells each. The address 1100100111 will be divided up as follows. You use the least significant four bits 0111 as the offset because it refers to which cell of a particular block you want. I think you accidentally included the block number as part of the offset. Anyway, the next least significant two bits 10 will be the block number and the most significant four bits 1100 will be the tag.
Your understanding seems to be fine. One thing more that is necessary is a bit to indicate if the cache block is valid or not. Good luck with the associative stuff!

Why is alignment imporant?

I know that some processors fail with misaligned data, and others like the oh-so-common x86, would just be slower with that.
My question is why? Why is it harder for an x86 processor to get the data from the pointer 0x12345679 than it is from the pointer 0x12345678? Just to be clear, I'm aware that page faults may happen if the data is in multiple pages, and I understand that more data may need to be fetched from memory (one part for the start of the value and one for the end), but that isn't always true and this isn't what my question is about. I'm asking, why is it always slower?
Suppose the memory starts at 0x10000000. Why is it harder for the processor to get a 2-byte short from 0x10000001 than it is from 0x10000002? Why is it harder to get a 4-byte int from 0x10000001 than it is from 0x10000000? And so forth.
Because the data bus is wider than eight bits.
Let assume that the data bus is 32 bits. To get 16 bits from address 0x10000001, it has to get the four bytes that starts at 0x10000000 and shift the value to get the two bytes in the middle.
To get 16 bits from the address 0x10000003, it has to get the words that start at 0x10000000 and 0x10000004, and use one byte from each value.
The processor can only access memory in an aligned fashion. This is a consequence of how the interconnect between the processor and memory functions.
When a processor supports unaligned reads, what's really happening is the processor issuing two separate reads (or one read of larger size) and stitching the parts together, which is why it's slower than an aligned read.
One example: if the databus is 32 bits and a 32 bit value is not on a 32 bit boundary, the bytes will have to be fetched in more than one operation and moved around to load the value properly into a processor register.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Resources