LRU vs FIFO vs Random - algorithm

When there is a page fault or a cache miss we can use either the Least Recently Used (LRU), First in Fist Out (FIFO) or Random replacement algorithms. I was wondering, which one provides the best performance aka the least possible future cache miss'/page faults?
Architecture: Coldfire processor

The expression "There are no stupid questions" fits this so well. This was such a good question I had to create an account and post on it and share my views as somebody who has modelled caches on a couple of CPUs.
You specify the architecture a 68000 which is a CPU rather than a GPU or USB controller, or other piece of hardware which may access a cache however...
Therefore the code you run on the 68000 will make a huge difference to the part of the question "least possible future cache miss'/page faults".
In this you differentiate between cache miss's and page faults, I'm not sure exactly which coldfire architecure you are refering to but I'm assuming this doesn't have a hardware TLB replacement, it uses a software mechansim (so the cache will be shared with the data of applications).
In the replacement policy the most important factor is the number of associations (or ways).
A direct map cache (1 way), directly correlates (all most always) with the low order bits of the address (the number of bits specify the size of the cache) so a 32k cache would be the lower 15bits. In this case the replacement algorthims LRU, FIFO or Random would be useless as there is only one possible choice.
However Writeback or Writethrough selection of the cache would have more effect. For writes to memory only Writethrough means the cache line isn't allocated as apposed to a writeback cache where the line currently in the cache which shares the same lower 15 bits is ejected out of the cache and read back in then modified, for use IF the code running on the CPU uses this data).
For operations that write and don't perform multiple operations on the data then writethrough is usually much better, also on modern processors (and I don' know if this architecture supports it), but Writethrough or Writeback can be selected on a TLB/Page basis. This can have a much greator effect on the cache than the policy, you can programme the system to match the type of data in each page, especially in a direct map cache ;-)
So a direct map cache is pretty easy to understand, it's also easy to understand the basis of cache's worst case, best case and average case.
Imagin a memcpy routine which copies data which is aligned to the cache size. For example a 32k direct mapped cache, with two 32k buffers aligned on a 32k boundry....
0x0000 -> read
0x8000 -> write
0x8004 -> read
0x8004 -> write
...
0x8ffc -> read
0x8ffc -> write
Here you see the read and write's as they copy each word of data, notice the lower 15 bits are the same for each read and write operation.
A direct mapped cache using writeback (remember writeback allocates lines does the following)
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8000 (modify this location in the cache with the read source data)
<loop>
0x0004 -> read
cache performs: (miss)
writeback 0x8000:0x801f -> WRITE to main memory (ie. write 32 bytes to the desitnation)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of source (the same as we did just before)
0x8004 -> write
cache performs: (miss)
invalidate 0x0000:0x001f (line 0)
0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
0x8004 (modify this location in the cache with the read source data)
</loop> <--- (side note XML is not a language but we use it as such)
As you see a lot of memory operations go on, this in fact is called "thrashing" and is the best example of a worst case scenairo.
Now imagine that we use a writethrough cache, these are the operations:
<loop>
0x0000 -> read
cache performs: (miss)
0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)
0x8000 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory) (posted is like a letter you just place it in the mailbox and you don't care if it takes a week to get there).
<loop>
0x0004 -> read
cache performs: (hit)
(not a lot, it just pulls the data it fetched last time which it has in it's memory so it goes very quickly to the CPU)
0x8004 -> write
cache performs: (not a miss)
(not a lot, the write is "posted" to main memory)
</loop until next 32 bytes>
</loop until end of buffer>
As you can see a huge difference we now don't thrash, in fact we are best case in this example.
Ok so that's the simple case of write through vs. write back.
Direct map caches however are now not very common most people use, 2,4 or 8 way cache's, that is there are 2, 4 or 8 different possible allocations in a line. So we could store 0x0000, 0x8000, 0x1000, 0x1800 all in the cache at the same time in a 4 way or 8 way cache (obviously an 8 way could also store 0x2000, 0x2800, 0x3000, 0x3800 as well).
This would avoid this thrashing issue.
Just to clarify the line number in a 32k direct mapped cache is the bottom 15 bits of the address.
In a 32k 2 way it's the bottom 14 bits.
In a 32k 4 way it's the bottom 13 bits.
In a 32k 8 way it's the bottom 12 bits.
And in a fully associatve cache it's the lines size (or the bottom 5 bits with a 32 byte line). You can't have less than a line. 32 bytes is typically the most optimum operation in a DDR memory system (there are other reasons as well, someimes 16 or sometimes 64 bytes can be better, and 1 byte would be optimal in the algorthmic case, lets uses 32 as that's very common)
To help understand LRU, FIFO and Random consider the cache is full associative, in a 32k 32 byte line cache this is 1024 lines.
A random replacement policy would on random cause a worst case hit every 1024 replacements (ie. 99.9% hit), in either LRU or FIFO I could always write a programme which would "thrash" ie. always cause a worst case behavouir (ie. 0% hit).
Clearly if you had a fully associative cache you would only choose LRU or FIFO if the programme was known and it was known the exact behavouir of the programme.
For ANYTHING which wasn't 99.9% predictable you would choose RANDOM, it's simply the best at not being worst, and one of the best for being average, but how about best case (where I get the best performance)...
Well it depends basically on the number of ways...
2 ways and I can optimise things like memcpy and other algorthims to do a good job. Random would get it wrong half the time.
4 ways and when I switch between other tasks I might not damage the cache so much that their data is still local. Random would get it wrong quater of the time.
8 ways now statistics can take effect a 7/8% hit rate on a memcpy isn't as good as a 1023/1024% (fully associative or optimised code) but for non optimised code it makes a difference.
So why don't people make fully associative cache's with random replacement policies!
Well it's not because they can't generate good random numbers, in fact a pseudo random number generator is just as good and yes I can write a programme to get 100% miss rate, but that isn't the point, I couldn't write a useful programme that would have 100% miss, which I could with an LRU or FIFO algo.
A 32k 32 byte line Fully associatve cache's require you to compare 1024 values, in hardware this is done through a CAM, but this is an expensive piece of hardware, and also it's just not possible to compare this many values in a "FAST" processing time, I wonder if a quantum computer could....
Anyway to answer your question which one is better:
Consider if writethrough might be better than writeback.
Large way's RANDOM is better
Unknown code RANDOM is better for 4 way or above.
If it's single function or you want the most speed from something your willing to optimise, and or you don't care about worst case, then LRU is probably what you want.
If you have very few way's LRU is likely what you want unless you have a very specific scenario then FIFO might be OK.
References:
http://www.freescale.com/files/training/doc/APF_ENT_T0801_5225_COLDFIRE_MCU.pdf
http://en.wikipedia.org/wiki/Cache_algorithms

No perfect caching policy exists because it would require knowledge of the future (how a program will access memory).
But, some are measurably better than others in the common memory access pattern cases. This is the case with LRU. LRU has historically given very good performance in overall use.
But, for what you are trying to do, another policy may be better. There is always some memory access pattern that will cause a caching policy to perform poorly.
You may find this thread helpful (and more elaborate!)
Why is LRU better than FIFO?

Many of the architectures I have studied use LRU, as it generally provides not only efficiency in implementation, but also is pretty good on average at preventing misses. However, in the latest x86 architectures, I think there are some more complicated things going on. LRU is sort of a basic model.
It really depends on what kind of operations you are performing on your device. Depending on the types of operations, different evacuation policies will work better. For example, FIFO works well with traversing memory sequentially.
Hope this helps, I'm not really an architecture guy.

Between the three, I'd recommend LRU. First, it's a good approximation to optimal scheduling when locality is assumed (this turns out to be a good assumption). Random scheduling cannot benefit from locality. Second, it doesn't suffer from Belady's anomaly (like FIFO); that is, bigger caches mean better performance, which isn't necessarily true with FIFO.
Only if your specific problem domain strongly suggests using something else, LRU is going to be hard to beat in the general case.

Of the three, LRU is generally the best while FIFO is the worst and random comes in somewhere between. You can construct access patterns where any of the three is superior to either of the others, but it is somewhat tricky. Interestingly enough, this order is also roughly how expensive they are to implement -- LRU is the most expensive and FIFO is the cheapest. Just goes to show, there's no free lunch

If you want the best of both worlds, consider an adaptive approach that alters it strategy based on actual usage patterns. For example, look at the algorith IBM's Adaptive Replacement Cache: http://code.activestate.com/recipes/576532-adaptive-replacement-cache-in-python/

Related

How to view paging system (demand paging) as another layer of cache?

I tried solving the following question
Consider a machine with 128MiB (i.e. 2^27 bytes) of main memory and an MMU which has a page size of 8KiB (i.e.2^13 bytes). The operating system provides demand paging to a large spinning disk.
Viewing this paging system as another layer of caching below the processor’s last-level cache (LLC), answer following questions regarding the characteristics of this “cache”:
Line size in bytes? 2^13 (every page has 2^13 bytes)
Associativity? Full Associative
Number of lines in cache? 2^14 (Memory size / page size)
Tag size in bits? 14 (Number of lines in cache is 2^14 which gives us 14 bits for tag)
Replacement policy? I am not sure at all (maybe clock algorithm which approximates LRU)
Writeback or write-through? write back (It is not consistent with Disk at all times)
Write-allocate? yes, because after page fault we bring the page to memory for both writing and reading
Exclusivity/Inclusivity? I think non-inclusive and non exclusive (NINE), maybe because memory mapped files are partially in memory and partially in swap file or ELF file (program text). Forexample stack of process is only in memory except when we run out of memory and send it to a swap file. Am I right?
I would be glad if someone checked my answers and help me solve this correctly, Thanks! Sorry, if this is not the place to ask these kind of questions
To start; your answers for line size, associativity and number of lines are right.
Tag size in bits? 14 (Number of lines in cache is 2^14 which gives us 14 bits for tag)
Tag size would be "location on disk / line size", plus some other bits (e.g. for managing the replacement policy). We don't know how big the disk is (other than "large").
However; it's possibly not unreasonable to work this out backwards - start from the assumption that the OS was designed to support a wide range of different hard disk sizes and that the tag is a nice "power of 2"; then assume that 4 bits are used for other purposes (e.g. "age" for LRU). If the tag is 32 bits (a very common "power of 2") then it would imply the OS could support a maximum disk size of 2 TiB ("1 << (32-4) * 8 KiB"), which is (keeping "future proofing" in mind) a little too small for an OS designed in the last 10 years or so. The next larger "power of 2" is 64 bits, which is very likely for modern hardware, but less likely for older hardware (e.g. 32-bit CPUs, smaller disks). Based on "128 MiB of RAM" I'd suspect that the hardware is very old (e.g. normal desktop/server systems started having more than 128 MiB in the late 1990s), so I'd go with "32 bit tag".
Replacement policy? I am not sure at all (maybe clock algorithm which approximates LRU)
Writeback or write-through? write back (It is not consistent with Disk at all times) Write-allocate? yes, because after page fault we bring the page to memory for both writing and reading
There isn't enough information to be sure.
A literal write-through policy would be a performance disaster (imagine having to write 8 KiB to a relatively slow disk every time anything pushed data on the stack). A write back policy would be very bad for fault tolerance (e.g. if there's a power failure you'd lose far too much data).
This alone is enough to imply that it's some kind of custom design (neither strictly write-back nor strictly write-through).
To complicate things more; an OS could take into account "eviction costs". E.g. if the data in memory is already on the disk then the page can be re-used immediately, but if the data in memory has been modified then that page would have to be saved to disk before the memory can be re-used; so if/when the OS needs to evict data from cache to make room (e.g. for more recently used data) it'd be reasonable to prefer evicting an unmodified page (which is cheaper to evict). In addition; for spinning disks, it's common for an OS to optimize disk access to minimize head movement (where the goal is to reduce seek times and improve disk IO performance).
The OS might combine all of these factors when deciding when modified data is written to disk.
Exclusivity/Inclusivity? I think non-inclusive and non exclusive (NINE), maybe because memory mapped files are partially in memory and partially in swap file or ELF file (program text). Forexample stack of process is only in memory except when we run out of memory and send it to a swap file. Am I right?
If RAM is treated as a cache of disk, then the system is an example of single-level store (see https://en.wikipedia.org/wiki/Single-level_store ) and isn't a normal OS (with normal virtual memory - e.g. swap space and file systems). Typically systems that use single-level store are built around the idea of having "persistent objects" and do not have files at all. With this in mind; I don't think it's reasonable to make assumptions that would make sense for a normal operating system (e.g. assume that there are executable files, or that memory mapped files are supported, or that some part of the disk is "swap space" and another part of the disk is not).
However; I would assume that you're right about "non-inclusive and non exclusive (NINE)" - inclusive would be bad for performance (for the same reason write-through would be bad for performance) and exclusive would be very bad for fault tolerance (for the same reason that write-back is bad for fault tolerance).

Cache Performance in Hash Tables with Chaining vs Open Addressing

In this following website from geeksforgeeks.org it states that
Cache performance of chaining is not good as keys are stored using linked list. Open addressing provides better cache performance as everything is stored in same table.
So my questions are:
What causes chaining to have a bad cache performance?
Where is the cache being used?
Why would open addressing provide better cache performance as I cannot see how the cache comes into this?
Also what considerations what you take into account when deciding between chaining and linear probed open addressing and quadratic probed open addressing?
Sorry, due to quite wide questions, the answers also will be quite generic with some links for more detailed information.
It is better to start with the question:
Where is the cache being used?
On modern CPUs, cache is used everywhere: to read program instructions and read/write data in memory. On most CPUs cache is transparent, i.e. there is no need to explicitly manage the cache.
Cache is much faster than the main memory (DRAM). Just to give you a perspective, accessing data in Level 1 cache is ~4 CPU cycles, while accessing the DRAM on the same CPU is ~200 CPU cycles, i.e. 50 times faster.
Cache operate on small blocks called cache lines, which are usually 64 bytes long.
More info: https://en.wikipedia.org/wiki/CPU_cache
What causes chaining to have a bad cache performance?
Basically, chaining is not cache friendly. It is not only about this case in the hash tables, same issue with "classical" lists.
Hash keys (or list nodes) are far away from each other, so each key access generates a "cache miss", i.e. slow DRAM access. So checking 10 keys in a chain takes 10 DRAM accesses, i.e. 200 x 10 = 2000 cycles for our generic CPU.
The address of the next key is not known until a next pointer is read in the current key, so there is not much room for an optimization...
Why would open addressing provide better cache performance as I cannot see how the cache comes into this?
Linear probing is cache friendly. Keys are "clustered" together, so once we accessed the first key (slow DRAM access), most probably the next key will be already in cache, since the cache line is 64 bytes. So accessing the same 10 keys with open addressing takes 1 DRAM access and 9 cache accesses, i.e. 200 x 1 + 9 x 4 = 236 cycles for our generic CPU. It is much faster than 2000 cycles for chained keys.
Also, since we access the memory in predictable manner, there is a room for optimizations like cache prefetching: https://en.wikipedia.org/wiki/Cache_prefetching
Also what considerations what you take into account when deciding between chaining and linear probed open addressing and quadratic probed open addressing?
Chaining or linear probing is not a good sign anyway. So the first thing I would consider is to make sure the probability of collisions is at minimum by using a good hash function and reasonable hash size.
The second thing I would consider is a ready to use solution. Sure, there are still might be some rare cases when you need your own implementation...
Not sure about the language, but here is blazingly fast hash table implementation with BSD license: http://dpdk.org/browse/dpdk/tree/lib/librte_hash/rte_cuckoo_hash.h
So, if you still need your own hash table implementation and you do care about performance, the next quite easy thing to implement would be to use cache aligned buckets instead of plain hash elements. It will waste few bytes per each element (i.e. each hash table element will be 64 bytes long), but in case of a collision there will be some fast storage for at least few keys. The code to manage those buckets will be also a bit more complicated, so it is a thing to consider if it is worth for you to bother...

What cache invalidation algorithms are used in actual CPU caches?

I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full.
There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ...
But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is?
Edit: Even when i chose an answer, any further information is welcome ;)
As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.
You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).
First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf
Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.
Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).
I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..
The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).
For Intel CPUs, the replacement policies are usually undocumented. I have done some experiments to uncover the policies in recent Intel CPUs, the results of which can be found on https://uops.info/cache.html. The code that I used is available on GitHub.
The following is a summary of my findings.
Tree-PLRU: This policy is used by the L1 data caches of all CPUs that I tested, as well as by the L2 caches of the Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell and Broadwell CPUs.
Randomized Tree-PLRU: Some Core 2 Duo CPUs use variants of Tree-PLRU in their L2 caches where either the lowest or the highest bits in the tree are replaced by (pseudo-)randomness.
MRU: This policy is sometimes also called NRU. It uses one bit per cache block. An access to a block sets the bit to 0. If the last 1-bit was set to 0, all other bits are set to 1. Upon a miss, the first block with its bit set to 1 is replaced. This policy is used for the L3 caches of the Nehalem, Westmere, and Sandy Bridge CPUs.
Quad-Age LRU (QLRU): This is a generalization of the MRU policy that uses two bits per cache block. Different variants of this policy are used for the L3 caches, starting with Ivy Bridge, and for the L2 caches, starting with Skylake.
Adaptive policies: The Ivy Bridge, Haswell, and Broadwell CPUs can dynamically choose between two different QLRU variants. This is implemented via set dueling: A small number of dedicated sets always use the same QLRU variant; the remaining sets are "follower sets" that use the variant that performs better on the dedicated sets. See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.

Which of a misaligned store and misaligned load is more expensive?

Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.

How does one write code that best utilizes the CPU cache to improve performance?

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.
The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.
I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.
I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).
Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.
The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.
One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.
A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold
Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.
I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.
Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.
To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)
There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf
Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.
Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.

Resources