What is overhead percentage? - caching

Consider a 2KB direct mapped cache with blocks of size 1 word. As
always, addresses are 32 bits.
How many blocks does the cache contain? 2^7
How many bits long is each tag? (Tags are shown in pink in the class
notes.) 2^23
How many bits long is each cache index? (These are green in the notes)
2^7
What is the total size of the cache? (32 + 1+ 23) x 2^7
What percentage of the total size is the overhead?
what is .. overhead .. and percentage of overhead.. ?

overhead is tag size, and any other bits the cache needs to store other than the data itself.
(e.g. for an associative cache with LRU replacement, it would need to store some bits that record the LRU state to track which member of the set is next in line for eviction.)
overhead percentage is obviously overhead / total size, as the assignment says. (not overhead / data).

Related

How to calculate cache block size from its overhead?

I've being looking for this for a lot of time now (over 3 days) without luck. Maybe one of you guys can tell me how can I solve it.
Consider you have a computer with a 16-bit size address and a byte addressable memory. The cache is 2-way set-associative mapped, write-back policy and a perfect LRU replacement strategy. Cache has an overhead of 4352 bits. What's the size of the block?
Very few resources talk about overhead and the ones I've found only relate it to total cache size. The problem is I only know how to calculate cache size with #blocks or at least with the fields of the address properly defined (which I have not being able to do for this problem since I can't calculate the size of the tag.).
Any help would be appreciated.
So, here's how I read this question:
Overhead bits are the bits that don't count toward the actual data that is being cached.  They are bits that track maintenance state of the cache, and help the cache implement hits, write back, and eviction policy.  To some way of looking at it, if one byte is being cached (8 bits) how many non-data bits are in the cache to help manage that (or at least for all the actual data bits how many non-data/overhead bits are there).
This is mathematical, so I hope I haven't made an error, but even if I have maybe you can see your way through the reasoning.
Let's derive some additional information:
A write-back policy means the cache needs to store "dirty" information for each data block: dirty is 1-bit: yes, dirty -or- no, clean.
For 2-way set associative cache, a "perfect" LRU algorithm is also 1 bit (yes: first block -or- no: second block) but this 1 bit costs per index position (i.e. per line) — not per block as there are two blocks per index.
What we don't know is if there is a valid bit, which would also be per data block, but most caches I see in coursework have the valid bits, so we might assume they have it.
And lastly, there's the tag bits where tag bits are: however many bits are leftover in the address space bits after accounting for index bits and block offset bits.
So, a formula for overhead might be:
overhead in bits = index positions * (1 x LRU bit + block overhead bits)
where block overhead bits = 2 [ways] * (1 x Dirty bit + 1 x Valid bit + tag bits)
We also know that tag bits = address space bits - index bits - block bits
So, we have:
4352 [overhead in bits] = index positions * (1 + 2 * (2 + tag bits))
-and-
tag bits = address space bits - index bits - block offset bits
-and-
index positions = 2index bits
-and-
We also know that the number of tag, index, and block offset bits has to be an integer (no fractions of bits).
So, we can begin to reduce those two formulas by substituting:
4352 = index positions * (1 + 2 * (2 + address space bits - index bits - block bits)
by reduction also then:
4352 = 2index bits * (1 + 2 * (2 + 16 - index bits - block bits)
Solving for block bits we have:
-((4352/2index bits - 1)/2 - 18 + index bits) = block bits
I don't know how to solve this directly mathematically, given the constraint that the variables must be integers, so, instead of solving directly, simply try/search different values:
If index bits is 7 then by this formula, block bits is fractional, so that doesn't work.
If index bits is 9 then by this formula, block bits is fractional, so that doesn't work.
No other values between 0 and 16 result in an integer number of bits, except:
If index bits is 8 then by this formula, block bits is 2, so:
16 = tag bits + 8 + 2, meaning tag bits is 6, index bits is 8, and block offset is 2.
Since block offset is 2 then block size is 22.

Set Associative Cache - How many memory blocks map to a set?

I'm studying for a final and I have the following example problem with professor answer:
Question
An instruction cache (I-cache) has a storage capacity of 524288 bytes
and employs 128-byte cache lines. The memory with which the cache is
used is 209715200 bytes in size. How many different memory blocks
could potentially map to set 10 within the 4-way set associative
cache?
Answer
Any memory block whose block number has bits 16 through 7 equal to
0000001010 would map to set 10. Since the memory is 209715200 bytes in
size (=0xC800000), so only the low 28 bits of the address will ever be
none-zero. That is, at most 28 bits are needed to specify any address
within the memory. So the upper 28-10-7 = 11 bits would determine how
many different blocks map to an individual set. The value in the upper
11 bits range from 0 to 11000111111 = decimal1599. So 1600 different
blocks map to set 10 within the 4-way set associative cache.
Can anyone explain this answer? I am particularly confused about the following items. I thought the upper 11 bits were the tag. Why would this have to do with whether it maps to set 10 or not. I also don't understand where the upper bound of 11000111111 for the 11 bits came from?
You are correct. The tag is the upper 11 bits. But what the poorly worded explanation is getting at is that the tag corresponding to the physical memory will never be mapped to greater than 11000111111, because the physical memory with a size of 0xC800000 has an upper limit of that number. In other words, a tag greater than 11000111111 will map to a physical address of 0xC800001 or greater.
Essentially to find the number of blocks that a set can be mapped to, just count the number of tag possibilities that there can be given the size of the physical memory and the size of the set bits and offset bits.

Direct Mapped Cache Bits

So Im having trouble understanding some parts to direct mapped caching. I have a byte addressed memory system that has 64KB memory with a 2KB direct-mapped cache. Cache blocks are 32 bytes.
From what I understand and please correct me if i'm wrong, I have 2048B/32B = 64 cache blocks. I need to figure out how many total bits are needed for each cache entry (tag, "dirty" bit, etc).
I believe i'll need 6 index bits (2^6 = 64 (# of blocks))
and 5 offset bits (2^5 = 32 (size of cache block))
Im just having trouble figuring out the rest that are needed.
The bits of a physical address can be split into 3 groups - the least significant group of bits that determine "offset of byte within cache block" and doesn't need to be stored in the tag, the middle group of bits that determine "index of cache block within the cache" and doesn't need to be stored in the tag, and the most significant group of bits that is used to check if the data in the cache is the data you want which must be stored in the tag.
With 64 KiB of physical address space a physical address would have 16 bits; and if your cache is 2048 bytes then (for "direct mapped") the least significant group of bits and the middle group of bits combined must add up to a total of 11 bits. That means the most significant group of bits (which must be stored in the tag) needs to be 5 bits (because 16 bits - 11 bits = 5 bits).
For other bits; you always need something to indicate if the entry is used or empty; if the cache is "write-back" you need a dirty bit but if the cache is "write-through" you don't; if there are multiple CPUs and cache coherency you need more bits for that (e.g. exclusive/shared); and if there's any kind of error detection or correction you need more bits for that (e.g. a "parity bit"). This means the total tag size is at least 6 bits (but may be more).

Calculating number of bits in a cache

Preface: There are many different design patterns that are important to cache's overall performance. Below are listed parameters for
different direct-mapped cache designs.
Cache data size: 32 kib
Cache block Size: 2 words
Cache access time: 1-cycle
Question: Calculate the number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, find the
total size of the closest direct-mapped cache with 16-word blocks of
equal size or greater. Explain why the second cache, despite its
larger data size, might provide slower performance that the first
cache.
Here's the formula:
Number of bits in a cache 2^n X (block size + tag size + valid field size)
Here's what I got: 65536(1+14X(32X2)..
is this correct?
using: (2^index bits) * (valid bits + tag bits + (data bits * 2^offset bits))
for the first one i get:
total bits = 2^15 (1+14+(32*2^1)) = 2588672 bits
for the cache with 16 word blocks i get:
total bits = 2^13(1 +13+(32*2^4)) = 4308992
the next smallest cache with 16 word blocks and a 32 bit address works out to be 2158592 bits, smaller than the first cache.
I'm stuck on the same problem too but I have the answer to the first part.
To calculate the total number of bits required
You need to convert the KB to words and get the index bits.
Use the answer from part 1 to get your tag bits.
Plug them into this formula.
(2^(index bits)) * ((tag bits)+(valid bits)+(data size))
Hint: data size is 64 bits in this case and valid bit is 1. So just find the index and tag bits.
And I don't think your answer is right. I didn't check but I can see you are multiplying 1+14 and (32x2) instead of adding them.
I think the formula you were using is correct. According to my textbook "Computer Organization and Design The Hardware, 5th edition", the total number of bits in a direct-mapped cache is:
2^indext bits * (block size + tag size + valid field size).
block size was given by the question: 2 words = 32 bits
tag size: 32 - offset in bits - index in bits
valid field size is usually 1 valid bit

Memory benchmark plot: understanding cache behaviour

I've tried every kind of reasoning I can possibly came out with but I don't really understand this plot.
It basically shows the performance of reading and writing from different size array with different stride.
I understand that for small stride like 4 bytes I read all the cell in the cache, consequently I have good performance. But what happen when I have the 2 MB array and the 4k stride? or the 4M and 4k stride? Why the performance are so bad? Finally why when I have 1MB array and the stride is 1/8 of the size performance are decent, when is 1/4 the size performance get worst and then at half the size, performance are super good?
Please help me, this thing is driving me mad.
At this link, the code: https://dl.dropboxusercontent.com/u/18373264/membench/membench.c
Your code loops for a given time interval instead of constant number of access, you're not comparing the same amount of work, and not all cache sizes/strides enjoy the same number of repetitions (so they get different chance for caching).
Also note that the second loop will probably get optimized away (the internal for) since you don't use temp anywhere.
EDIT:
Another effect in place here is TLB utilization:
On a 4k page system, as you grow your strides while they're still <4k, you'll enjoy less and less utilization of each page (finally reaching one access per page on the 4k stride), meaning growing access times as you'll have to access the 2nd level TLB on each access (possibly even serializing your accesses, at least partially).
Since you normalize your iteration count by the stride size, you'll have in general (size / stride) accesses in your innermost loop, but * stride outside. However, the number of unique pages you access differs - for 2M array, 2k stride, you'll have 1024 accesses in the inner loop, but only 512 unique pages, so 512*2k accesses to TLB L2. on the 4k stride, there would be 512 unique pages still, but 512*4k TLB L2 accesses.
For the 1M array case, you'll have 256 unique pages overall, so the 2k stride would have 256 * 2k TLB L2 accesses, and the 4k would again have twice.
This explains both why there's gradual perf drop on each line as you approach 4k, as well as why each doubling in array size doubles the time for the same stride. The lower array sizes may still partially enjoy the L1 TLB so you don't see the same effect (although i'm not sure why 512k is there).
Now, once you start growing the stride above 4k, you suddenly start benefiting again since you're actually skipping whole pages. 8K stride would access only every other page, taking half the overall TLB accesses as 4k for the same array size, and so on.

Resources