Reducing calls to main memory, given heap-allocated objects - caching

The OP here mentions in the final post (4th or so para from bottom):
"Now one thing that always bothered me about this is all the child
pointer checking. There are usually a lot of null pointers, and
waiting on memory access to fill the cache with zeros just seems
stupid. Over time I added a byte that contains a 1 or 0 to tell if
each of the pointers is NULL. At first this just reduced the cache
waste. However, I've managed cram 9 distance comparisons, 8 pointer
bits, and 3 direction bits all through some tables and logic to
generate a single switch variable that allows the cases to skip the
pointer checks and only call the relevant children directly. It is in
fact faster than the above, but a lot harder to explain if you haven't
seen this version."
He is referring to octrees as the data structure for real-time volume rendering. These would be allocated on the heap, due to their size. What I am trying to figure out is:
(a) Are his assumptions in terms of waiting on memory access, valid? My understanding is that he's referring to waiting on a full run out to main memory to fetch data, since he's assuming it won't be found in the cache due to generally not-too-good locality of reference when using dynamically-allocated octrees (common for this data structure in this sort of application).
(b) Should (a) prove to be true, I am trying to figure out how this workaround
Over time I added a byte that contains a 1 or 0 to tell if each of the
pointers is NULL.
would be implemented without still using the heap, and thus still incurring the same overhead, since I assume it would need to be stored in the octree node.

(a) Yes, his concerns about memory wait time are valid. In this case, he seems to be worried about the size of the node itself in memory; just the children take up 8 pointers, which is 64 bytes on a 64-bit architecture, or one cache line just for the children.
(b) That bitfield is stored in the node itself, but now takes up only 1 byte (1 bit for 8 pointers). It's not clear to me that this is an advantage though, as the line(s) containing the children will get loaded anyway when they are searched. However, he's apparently doing some bit tricks that allow him to determine which children to search with very few branches, which may increase performance. I wish he had some benchmarks that would show the benefit.

Related

Understanding Cache line invalidation and striped locks for Concurrent HashMap Implementation

If we put the striped locks very close to each other in memory for a concurrent hashmap, the cache line size can affect performance because we would have to invalidate caches unnecessarily. If you add padding to the array of striped locks, it will improve performance.
Can someone explain this?
To start with a non-concurrent hashmap, the basic principle is this:
Have a indexed structure (most often an array or set of arrays) for the keys and values.
Get a hash for the key.
Reduce this to be within the size of the arrays. (Modulo does this simply enough, so if the hash value is 123439281 and there are 31 slots available, then we use 123439281 % 31 which is 9 and use that as our index).
See if there's a key there, and if so if it matches (equals).
Store the key if it's new, and the value.
The same approach can be used to find the value for a given key (or to find that there is none).
Of course the above doesn't work if there's a key in the same slot that isn't equal to the key you're concerned with, and there are different approaches to dealing with this, mainly either continuing to look in different slots until one is free, or having slots actually act as a linked list of equal-indexed pairs. I won't go into the details of this.
If you are looking to other slots it won't work once you've filled the arrays (and will be slow before that point) and if you are using linked-lists to handle collisions you will be very slow if you have many keys at the same index (the O(1) you want becomes closer and closer to O(n) as this gets worse). Either way you're going to want a mechanism to resize the internal store when the amount stored gets too large.
Okay. That's a very high-level description of a hashmap. What if we want to make it threadsafe?
It won't be threadsafe by default as e.g. two different threads writing different keys whose hash modulo down to the same value, then one thread might stomp over the other.
The simplest way to make a hashmap threadsafe is to simply have a lock that we use on all operations. That means that every thread will wait on every other thread though, so it won't have very good concurrent behaviour. With a bit of clever structuring it's not too hard to have it that we can have multiple reading threads or a single writing thread (but not both), but that still isn't great.
It's possible to create hashmaps that are safely concurrent without using locks at all (see http://www.communicraft.com/blog/details/a-lock-free-dictionary for a description of one I wrote in C#, or http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable for one in Java by Cliff Click whose basic approach I used in my C# version).
But another approach is striped locks.
Because the basis of the map is either an array for key-value pairs (and likely a cached copy of the hashcode) or a set of arrays for them, and because it is generally safe to have two threads writing and/or reading to different parts of an array at a time (there are caveats, but I'll ignore them for now) the only problems are when either two threads want the same slot, or when a resize is necessary.
And therefore the different slots could have different locks, and then only threads that are operating on the same slot would need to wait on each other.
There'd still be the problem of resizing, but that isn't insurmountable; if you need to resize obtain every one of the locks (in a set order, so that you can prevent deadlocks from happening) and then do the resize, having first checked that no other thread did the same resize in the meantime.
However, if you've a hashmap with 10,000 slots this would mean 10,000 lock objects. That's a lot of memory used, and also a resize would mean obtaining every one of those 10,000 locks.
Striped locks are somewhere in-between the single-lock approach and the lock-per-slot approach. Have an array of a certain number of locks, say 16 as a nice (binary) round number. When you need to act on a slot then obtain lock number slotIndex % 16, and do your operation. Now while threads may still end up blocking on threads doing operations on completely different slots (slot 5 and slot 21 have the same lock) they can still act concurrently to many other operations, so it's a middle-ground between the two extremes.
So that's how striped locking works, at a high level.
Now, modern day memory access is not uniform, in that it does not take the same time to access arbitrary pieces of memory because there is a level of caching (generally at least 2 levels) in the CPU. This caching has both good and bad effects.
Obviously the good effects normally outweigh the bad, or chip manufacturers wouldn't use it. If you access a piece of memory, and then access a piece of memory very close to it, the chances are that second access will be very fast because it will have been loaded into the cache on the first read. Writes are also improved.
It's already natural enough that a given piece of code is likely to want to do several operations on blocks of memory close to each other (e.g. reading two fields in the same object, or two locals in a method), which is why this sort of caching worked in the first place. And programmers further work to take advantage of this fact as much as possible in how they design their code, and collections such as hashmaps are a classic example. E.g. we might have stored keys and stored hashes in the same array so that reading one brings the other into the cache to be quickly read, and so on.
There are though times when this caching has a negative effect. In particular if two threads are going to deal with bits of memory that are close to each other at around the same time.
This doesn't come up that often, because threads are most often dealing with their own stacks or bits of heap memory pointed to by their own stacks, and only occasionally heap memory that is visible to other threads. That in itself is a big part of why CPU caches are normally a big win for performance.
However, the use of concurrent hashmaps is inherently a case where multiple threads hit neighbouring blocks of memory.
CPU caches work on the basis of "cache lines". These are blocks of code that are loaded into the cache from the RAM, or written from the cache to the RAM as a unit. (Again, while we're about to discuss a case where this is a bad thing, this is an efficient model most of the time).
Now, consider a 64-bit processor with 64-byte cache-lines. Every pointer or reference to an object is going to take up 8 bytes. If a piece of code tries to access such a reference it will mean that 64 bytes are loaded into the cache, then 8 bytes of that dealt with by the CPU. If the CPU writes to that memory, then those 8 bytes are changed on the cache, and the cache written back to the RAM. As said, this is generally good, because the odds are high that we'll also want to do the same with other bits of RAM nearby, and hence in the same cache line.
But what if another thread wants to hit the same block of memory?
If CPU0 goes to read from a value that is in the same cachline that CPU1 has just written to, it will have a stale cacheline that has been invalidated and have to read it again. If CPU0 was trying to write to it it may well not only have to read it again, but redo the operation that gave it the result to write.
Now, if that other thread had wanted to hit the exact same bit of memory, there'd have been a conflict even without caching, so things aren't that much worse than they would have been (but they are worse). But if the other thread was going to hit nearby memory it will still suffer.
This is obviously bad for our concurrent map's slots, but its even worse for its striped locks. We'd said we might have 16 locks. With 64-byte cachelines and 64-bit references that's 2 cachelines for all the locks. The odds a lock is in the same cacheline as that wanted by the other thread is 50%. With 128-byte cachelines (Itanium has those) or 32-bit references (all 32-bit code uses those) it's 100%. With lots of threads its effectively 100% that you're going to be waiting. And waiting again if there's yet another hit. And waiting.
Our attempt to prevent threads waiting on the same lock has turned into them waiting on the same cacheline.
Worse, the more cores you have using the locks, the worse this becomes. Each extra core slows down the total throughput roughly exponentially. 8 cores might take over 200 times as long to execute as 1 core would!
If however we pad out our striped locks with blank space so that there is a 56-byte gap between each one, then this doesn't happen; the locks are all on different cachelines, and operations on neighbouring locks don't affect it any more. This costs memory, and makes normal reading and writing slower (the point of caches is that it makes things faster most of the time after all), but is appropriate in cases where particularly frequent concurrent access is expected, and we're not likely to want to hit the next lock (we aren't, except for resize operations). (Another example would be striped counters; have different threads increment different integers and sum them when you want to get the tally).
This problem of threads hitting neighbouring pieces of memory (called "false-sharing" because it has a performance impact caused by shared access to the same memory even though they are actually accessing neighbouring memory rather than the same memory) will also affect the internal storage of the hashmap itself, but not as much because the map itself is likely larger and so the odds of two accesses hitting the same cacheline is less. It would also be more expensive to use padding here for the same reason; being larger the amount of padding that would involve could be huge.

Heap Type Implementation

I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?
As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.

Searching for membership in array of ranges

As part of our system simulation, I'm modeling a memory space with 64-bit addressing using a sparse memory array and keeping a list of objects to keep track of buffers that are allocated within the memory space. Buffers are allocated and de-allocated dynamically.
I have a function that searches for a given address or address range within the allocated buffers to see if accesses to the memory model are in allocated space or not, and my first cut "search through all the buffers until you find a match" is slowing down our simulations by 10%. Our UUT does a lot of memory accesses that have to be vetted by the simulation.
So, I'm trying to optimize. The memory buffer objects contain a starting address and a length. I'm thinking about sorting the object array by starting address at object creation, and then, when the checking function is called, doing a binary search through the array looking to see if a given address falls within a start/end range.
Are there any better/faster ways to do this? There must be some faster/cooler algorithm out there using heaps or hash signatures or some-such, right?
Binary search through a sorted array works but makes allocation/deallocation slow.
A simple case is to make an ordered binary tree (red-black tree, AVR tree, etc.) indexed by the starting address, so that insertion (allocation), removal (deallocation) and searching are all O(log n). Most modern languages provide such data structure (e.g. C++'s std::map) already.
My first thought was also binary search and I think that it is a good idea. You should be able to insert and remove quickly too. Using a hash would just make you put the addresses in buckets (in my opinion) and then you'd get to the right bucket quickly (and then have to search through the bucket).
Basically your problem is that you have a defined intervals of "valid" memory, memory outside those intervals is "invalid", and you want to check for a given address whether it is inside a valid memory block or not.
You can definitely do this by storing the start addresses of all allocated blocks in a binary tree; then search for the largest address at or below the queried address, and just verify that the address falls within the length of the valid address. This gives you O(log n) query time where n = number of allocated blocks. The same query of course can be used also to actually the find the block itself, so you can also read the contents of the block at the given address, which I guess you'd need also.
However, this is not the most efficient scheme. Instead, you could use additionally one-dimensional spatial subdivision trees to mark invalid memory areas. For example, use a tree with branching factor of 256 (corresponding to 8 bits) that maps all those 16kB blocks that have only invalid addresses inside them to "1" and others to "0"; the tree will have only two levels and will be very efficient to query. When you see an address, first ask form this tree if it's certainly invalid; only when it's not, query the other one. This will speed things up ONLY IF YOU ACTUALLY GET LOTS OF INVALID MEMORY REFERENCES; if all the memory references are actually valid and you're just asserting, you won't save anything. But you can flip this idea also around and use the tree mark to all those 16kB or 256B blocks that have only valid addresses inside them; how big the tree grows depends on how your simulated memory allocator works.

How does one write code that best utilizes the CPU cache to improve performance?

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.
The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.
I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.
I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).
Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.
The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.
One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.
A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold
Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.
I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.
Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.
To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)
There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf
Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.
Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.

Resources