image of textbook problem
From my understanding, each 32-byte block should hold four contiguous algae_position structures, as an algae_position is defined to be a structure with two ints, each being four bytes for a total of 8 bytes. So 32/8 = 4. As a result, I thought that the pattern would be miss hit hit hit, leading to a miss rate of 25% and a total misses of 512. Thank you for shedding any clarity to this.
Related
Consider a 2KB direct mapped cache with blocks of size 1 word. As
always, addresses are 32 bits.
How many blocks does the cache contain? 2^7
How many bits long is each tag? (Tags are shown in pink in the class
notes.) 2^23
How many bits long is each cache index? (These are green in the notes)
2^7
What is the total size of the cache? (32 + 1+ 23) x 2^7
What percentage of the total size is the overhead?
what is .. overhead .. and percentage of overhead.. ?
overhead is tag size, and any other bits the cache needs to store other than the data itself.
(e.g. for an associative cache with LRU replacement, it would need to store some bits that record the LRU state to track which member of the set is next in line for eviction.)
overhead percentage is obviously overhead / total size, as the assignment says. (not overhead / data).
This is the basic understanding that I have so far for caches:
For direct-mapped caches, there is exactly one "line" or one "block" per set.
For a fully associative cache, there is exactly 1 set which contains all the blocks or lines.
For a n-way associative cache, there is exactly n lines or n blocks per set.
My question is, how do I go about calculating the number of sets given the total cache size and the block size? Is it just the cache size / block size? Is there a general formula that can be used for all types of caches to calculate this?
I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?
Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.
I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.
I've tried every kind of reasoning I can possibly came out with but I don't really understand this plot.
It basically shows the performance of reading and writing from different size array with different stride.
I understand that for small stride like 4 bytes I read all the cell in the cache, consequently I have good performance. But what happen when I have the 2 MB array and the 4k stride? or the 4M and 4k stride? Why the performance are so bad? Finally why when I have 1MB array and the stride is 1/8 of the size performance are decent, when is 1/4 the size performance get worst and then at half the size, performance are super good?
Please help me, this thing is driving me mad.
At this link, the code: https://dl.dropboxusercontent.com/u/18373264/membench/membench.c
Your code loops for a given time interval instead of constant number of access, you're not comparing the same amount of work, and not all cache sizes/strides enjoy the same number of repetitions (so they get different chance for caching).
Also note that the second loop will probably get optimized away (the internal for) since you don't use temp anywhere.
EDIT:
Another effect in place here is TLB utilization:
On a 4k page system, as you grow your strides while they're still <4k, you'll enjoy less and less utilization of each page (finally reaching one access per page on the 4k stride), meaning growing access times as you'll have to access the 2nd level TLB on each access (possibly even serializing your accesses, at least partially).
Since you normalize your iteration count by the stride size, you'll have in general (size / stride) accesses in your innermost loop, but * stride outside. However, the number of unique pages you access differs - for 2M array, 2k stride, you'll have 1024 accesses in the inner loop, but only 512 unique pages, so 512*2k accesses to TLB L2. on the 4k stride, there would be 512 unique pages still, but 512*4k TLB L2 accesses.
For the 1M array case, you'll have 256 unique pages overall, so the 2k stride would have 256 * 2k TLB L2 accesses, and the 4k would again have twice.
This explains both why there's gradual perf drop on each line as you approach 4k, as well as why each doubling in array size doubles the time for the same stride. The lower array sizes may still partially enjoy the L1 TLB so you don't see the same effect (although i'm not sure why 512k is there).
Now, once you start growing the stride above 4k, you suddenly start benefiting again since you're actually skipping whole pages. 8K stride would access only every other page, taking half the overall TLB accesses as 4k for the same array size, and so on.
Sorry in advance if I have some of this wrong. I may edit to correct later if it's not too disruptive.
When multiple variables are declared in adjacent memory, as I understand it, on a very low level, registers are created that encapsulate a number of bytes, commonly 1, 2, 4 or 8. This allows those bit ranges to be binary rotated, as well as thought of by the processor as numbers and so mutated with simple mathematics such as add, subtract, multiply and devide.
There may be abstraction reasons for not overlapping thease ranges, but as many langueges consider instructions to occur in a well defined sequential order that the coder will be aware of, are there any performance reasons to not overlap one or more in adjacent bytes of allocated memory?
For example in a block of allocated memory where every bit starts as 0. Bytes 0 to 3 could be being used as an int, as well as bytes 1 to 4. The first could be set to a value before the second range was multiplied by 3.
If there are performance reasons not to then are they overcome by otherwise having to to copy values in and out of completely new variables and perform more complicated processes to achieve certain algorithms that could otherwise be done on a very low level?
There is nothing wrong with this trick when it is done in assembly: optimizers have been routinely making use of knowing where parts of an integer are to save CPU cycles and reduce the size of the code. For example, when a 32-bit integer variable is initialized to a value that fits in only 16 bits, optimizing compilers would replace the instruction that stores a 32-bit value in memory with a faster instruction that stores a 16-bit value to the lower bits of the variable, and clear the upper 16 bits. Moreover, many optimizers would go even further: if a constant is divisible by 2^16, they would store the value divided by 2^16 to the upper 16 bits, and clear lower 16 bits.
Some architectures restrict such manipulations to addresses of certain properties, for example, by requiring all 4-byte memory load/store instructions to be done at addresses divisible by four. These restrictions may reduce applicability of partial-value writing tricks.