optimal memory layout for read-only/write memory segments

optimal memory layout for read-only/write memory segments - performance

Suppose I have two memory segments (equal size each, approximately 1kb in size) , one is read-only (after initialization), and other is read/write.
what is the best layout in memory for such segments in terms of memory performance? one allocation, contiguous segments or two allocations (in general not contiguous). my primary architecture is linux Intel 64-bit.
my feeling is former (cache friendlier) case is better.
is there circumstances, where second layout is preferred?

I would put the 2KB of data in the middle of a 4KB page, to avoid interference from reads and writes close to the page boundary. Similarly, keeping the write data separate is also good idea for the same reason.
Having contiguous read/write blocks may be less effiicent than keeping them separate. For example, a cache that is storing data for code interested in just the read-only portion may become invalidated by a write from another cpu. The cache line will be invalidated and refreshed, even though the code wasn't reading the writable data. By keeping the blocks separate, you avoid this case, and writes to the writable data block only invalidate cache lines for the writable block, and do not interfere with cache lines for the read only block.
Note that this is only a concern at the block boundary between the readable and writable blocks. If your block sizes were much larger than the cache line size, then this would be a peripheral problem, but as your blocks are small, requiring just a few cache lines, then the problem of invalidating lines could be significant.

With that small of data, it really shouldn't matter much. Both of those arrays will fit into any level cache just fine.

It'll depend on what you're doing with the memory. I'm fairly certain that contiguous (and page aligned!) would never be slower than two randomly placed segments, but it won't necessarily be any faster.

Given that it's an Intel processor, you probably only need to ensure that the addresses are not exactly a multiple of 64k apart. If they are, loads from either section that map to the same modulo 64k address will collide in L1 and cause an L1 miss. There's also a 4MB aliasing issue, but I'd be surprised if you ran into that.

Related

How does cache associativity impact performance [duplicate]

This question already has answers here:
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
(3 answers)
Closed 3 years ago.
I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the
critical stride value. This means that all elements from the first
column match the same eight cache lines of L1. We don’t really have
performance benefits from the cache because we can’t use it
efficiently: we have only 512 bytes (8 cache lines * 64-byte cache
line size) instead of the original 32 kilobytes. When we iterate the
first column in a loop, the corresponding elements pop each other from
the cache. When N=1023 and N=1025, we don’t have problems with the
critical stride anymore: all elements can be kept in the cache, which
is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?

Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.
The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).
Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.
Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.
Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".
Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)
A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.
Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".
Back to your Question...
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
There are conceptual problems with that quote.
Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.
Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.
Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

Minimizing page faults (and TLB faults) while "walking" a large graph

Problem (think of the mark phase of a GC)
I have a graph of “objects” that I need to walk, visiting all objects.
I can store in each object if it has been visited.
All the objects are stored in memory and linked together using normal pointers.
The objects are not all the same size.
Sometimes there is not enough ram in the system to hold all the objects in memory at the same time, and I wish to avoid “page thrashing”.
I also wish to avoid TLB faults
Other times, there is more than enough ram.
I do not mind writing low-level code.
I do not mind different code for windows and linux.
The code must run in “user space” without needing none standard permissions.
I don't care the order I visit the nodes in.
I am going to ask more detail questions about possible solutions, linking back to this questions.

Page faults aren't necessarily bad, as long as they're not stalling your progress.
This means that if you have a node Node* p with two candidate successors p->left and p->right, it can be useful to pick the nearest (in terms of (char*)p - (char*)p->next) and pre-fetch the other (e.g. with PrefetchVirtualMemory).
How efficient this will be cannot be predicted; it greatly depends on your graph topology. But the prefetch is virtually free when you have enough RAM.
Closer to the CPU, there's cache prefetching. Same idea, different storage

Use 2M hugepages for address ranges that are full of "hot" data that the kernel can't usefully swap out any / many 4k chunks of. This will reduce TLB misses, but costs extra physical memory if there are any 4k chunks of a hugepage that aren't hot.
Linux does this transparently for anonymous pages (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), but you can use madvise(MADV_HUGEPAGE) on pages you know are worth it, to encourage the kernel to defrag physical memory even if that's not the default in /sys/kernel/mm/transparent_hugepage/defrag. (You can look at /proc/PID/smaps to see how many transparent hugepages are in use for any given mapping.)
Based on what you posted in your answer: An ordered set of nodesToVisit would give you the most locality, but might be too expensive to maintain. Multiple accesses within the same 64-byte cache line are much cheaper than coming back to it later after it's been evicted from L3 cache and has to come from DRAM again.
If you have lots of addresses to visit in your Set, doing one pass of a radix-sort into 2M buckets would give you locality within one hugepage. 2M is also smaller than L3 cache size, so you'll probably get some cache hits when visiting multiple objects in the same cache line, even if you don't hit them back to back.
Depending on how big your Set is, throwing around that many pointers even to partial-sort them might not be worth the memory traffic that takes. But there's probably some sweet spot of taking a window of data and at least partially sorting it. Using the pointers before they are evicted from cache is nice.
SW prefetch can trigger a page-walk to avoid a TLB miss, so you could _mm_prefetch(_MM_HINT_T2) one address from the next 2M bucket before starting on the current bucket. See also Prefetching Examples?. I haven't tested this, but it might work well. It won't help with page faults: prefetch from an unmapped page won't cause a page fault, and you don't want to trigger an actual PF until you're ready to touch the page.
MSalter's suggestion to ask the OS to prefetch and wire the next page is interesting (I think madvise(MADV_WILLNEED) is the Linux equivalent), but a system call will be slow for no benefit if the page was already mapped+wired into the HW page table. There's no x86 asm instruction that just asks if a page is mapped without faulting if it isn't, so I can't think of a way to efficiently choose not to call it. And BTW, I think Linux breaks up transparent hugepages into 4k regular pages for paging in/out. But don't write a big loop that just does _mm_prefetch() or madvise on all the 4k pages in a 2M block; that probably sucks. The prefetcht2 part would probably just result in excess prefetch requests being dropped.
Use perf counters to look at cache hit/miss rates. On Intel CPUs, the mem_load_retired.l1_miss and/or .l2_miss event should show you whether you're getting cache hits on accessing the Set itself, as well as on accessing dereferencing those pointers. Those counters are precise events, so they should map accurately to asm load instructions. (e.g. perf record -e mem_load_retired.l2_miss ./my_program / perf report on Linux).
We remove one item at a time from nodesToVisit
I don't know much about GC design, but can't you use a sequence number or tagged-pointer or something to avoid modifying the Set data structure itself every GC pass? If your minimum object alignment is 4 bytes, you have 2 bits to play with at the bottom of every pointer. ANDing them off before dereferencing is very cheap.
x86-64 with full 64-bit pointers currently requires the top 16 to be the sign-extension of the low 48. So you could use bits there (16 bits, or maybe just the top byte) if you re-canonicalize pointers. (redo sign extension, or just zero the high 16 bits if you want to assume user-space pointers; Linux uses a high-half kernel VM layout so user-space addresses are always in the low half of virtual address space. IDK what Windows does.)
On x86-64, you might consider using the x32 ABI (32-bit pointers in long mode) if 4GiB of address space is enough, especially if you're hitting physical memory limits and swapping. Smaller pointers mean smaller data structures, thus half the cache footprint.
Some Linux systems are built without kernel support for x32, though, only classic x86-64 and usually 32-bit mode. But if it works on your systems, consider gcc -mx32.

These are my first thoughts about a possible solution, they are clearly not optimal. I will delete this answer if someone posts a better answer.
The basic method:
Assume we have a Set<NodePointer> nodesToVisit that contains all nodes we have not yet visited.
We remove one item at a time from nodesToVisit,
and if it has not been visited before we add all “pointers to other nodes” to nodesToVisit.
Improvements:
But we can clearly do better, by ordering nodesToVisit based on address, so that we are more likely to visit nodes that are contained in pages we have recently accessed. This could be as simple as having a second Set<NodePointer> nodesToVisitLater, and putting any node that has an address a long way from the current node into it.
Or we could skip over any node that are contained in pages that are not resident in memory, visiting these nodes after we have visited all nodes that are currently in memory.
(The"set" could just be a stack, as visiting a node more than once is a "no-opp")
https://patents.google.com/patent/US7653797B1/en seems to be related, but I have not read it yet.
https://hosking.github.io/links/Cher+2004ASPLOS.pdf
https://people.cs.umass.edu/~emery/pubs/cramm.pdf
https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf
https://people.cs.umass.edu/~emery/pubs/04-16.pdf

How can I better understand the impact of modern caching on algorithm performance?

I'm reading the following paper: http://www-db.in.tum.de/~leis/papers/ART.pdf and in it, they say in the abstract:
Main memory capacities have grown up to a point where most databases
fit into RAM. For main-memory database systems, index structure
performance is a critical bottleneck. Traditional in-memory data
structures like balanced binary search trees are not efficient on
modern hardware, because they do not optimally utilize on-CPU caches.
Hash tables, also often used for main-memory indexes, are fast but
only support point queries.
How can I better understand this utilization of on-CPU caches and how it impacts the performance of particular data structures/algorithms?
Just somewhere to get started would be great because this sort of analysis is really opaque to me and I don't know where to go to start understanding.

This is going to be a really basic answer, as it would otherwise be extremely broad. I'm also not an expert on the subject (picking up bits and pieces to help understand how to optimize my hotspots better). But it might help you get started investigating this subject.
The topic reminds me of my university days when computer architecture
courses only taught about registers, DRAM, and disk, while glossing
over the CPU cache in between. The CPU cache is one of the most
dominant factors these days in performance.
The memory of the computer is divided into a hierarchy ranging from the absolute biggest but slowest (disk) to absolute smallest but fastest (registers).
Below disk is DRAM which is still pretty slow. And above registers is the CPU cache which is pretty damned fast (especially the smallest L1 cache).
Accessing One Node
Now let's say you request to access memory in some form from some data structure, say a linked structure like a tree or linked list and we're just accessing one node.
Note, I'm inverting the view of memory access for simplicity. Typically it begins with an instruction to load something into a register with the process working backwards and forwards, rather than merely forwards.
Virtual to Physical (DRAM)
In this case, unless the memory is already mapped to physical memory, the operating system has to map a page from virtual memory to a physical address in DRAM (this is freaking slow, especially in the worst-case scenario where the page fault involves a disk access). This is often done in pretty hefty chunks (the machine grabs memory by the handful), like aligned 4-kilobyte chunks. So we end up grabbing a big old 4-kilobyte aligned chunk of memory just for this one node.
DRAM to CPU Cache
Now that this 4-kilobyte page is physically mapped, we still want to do something with the node (most instructions have to operate at the register level) so the computer moves it down through the CPU cache hierarchy (this is pretty slow). Typically all levels of CPU cache have the same cache-line size, like 64-byte cache lines on Intel.
To move the memory from DRAM into these CPU caches, we have to grab a chunk of cache-line-sized-and-aligned memory from DRAM and move it into the CPU cache. We might also have to evict some data already in various levels of the CPU cache hierarchy on the way, like the least recently used memory. So now we're grabbing a 64-byte aligned handful of memory for this node.
Maybe at this point, the cache line memory might look like this. Let's say the relevant node data is 42, while the stuff in ??? is irrelevant memory surrounding it that's not part of our linked data structure.
CPU Cache to Register
Now we move the memory from CPU cache into a register (this occurs very quickly). And here we're still grabbing memory in sort of a handful, but a pretty small one. For example, we might grab a 64-bit aligned chunk of memory and move it into a general-purpose register. So we grab the memory around "42" here and move it into a register.
Finally we do some operations on the register and store the results, and the results often kind of work their way back up the memory hierarchy.
Accessing One Other Node
When we access the next node in the linked structure, we end up having to potentially do this all over again, just to read one little node's data. The contents of the cache line might look like this (with 22 being the node data of interest).
We can see potentially how much wasted effort the hardware and operating system are applying, moving big, aligned chunks of data from slower memory to faster memory only in order to access one little teeny bit of it prior to eviction.
And that's why little objects all allocated separately, as in the case of linked nodes or languages which can't represent user-defined types contiguously, aren't very cache or page-friendly. They tend to invoke a lot of page faults and cache misses as we traverse them, accessing their data. That is, unless they have help from a memory allocator which allocates these nodes in a more contiguous fashion (in which case the data or two or more nodes might be right next to each other and accessed together).
Contiguity and Spatial Locality
The most cache-friendly data structures tend to be based on contiguous arrays (it doesn't have to be one gigantic array, but perhaps arrays linked together, e.g., as is the case of an unrolled list). When we iterate through an array and access the first element, we might have to do the motions described above yet we might be able to get this once the memory is moved into a cache line:
Now we can iterate through the array and access all the elements while it's in the second-fastest form of memory on the machine, the L1 cache, simply moving data from L1 cache to register after the initial compulsory cache miss/page fault. If we start at 17, we have the initial compulsory cache miss but all the subsequent elements in this cache line can then be accessed without repeating the motions above. This is extremely fast, and the computer can blaze through such data.
So that was what was meant by this part:
Traditional in-memory data structures like balanced binary search
trees are not efficient on modern hardware, because they do not
optimally utilize on-CPU caches.
Note that it is possible to make linked structures like trees and linked lists substantially more cache-friendly than they would naturally be using a custom memory allocator, but they lack this inherent cache-friendliness at the basic data structure level.
Hash tables, on the other hand, tend to be contiguous table structures based on arrays. They might use chaining and linked bucket structures, but those are also easier to make cache-efficient with a little teeny bit of help from the custom allocator (far less than the tree due to the simpler, sequential access patterns within a hash bucket).
So anyway, that's a little brief overview on the subject, a bit oversimplified, but hopefully enough to help get started. If you want to understand this subject at a deeper level, keywords would be cache/memory efficiency/optimization and locality of reference.

How does one write code that best utilizes the CPU cache to improve performance?

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.

I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.

The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.

I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).

Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.

The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.

One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.

A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold

Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.

I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.

Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.

To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)

There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf

Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.

Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.

Information on N-way set associative Cache stides

Several of the resources I've gone to on the internet have disagree on how set associative caching works.
For example hardware secrets seem to believe it works like this:
Then the main RAM memory is divided in
the same number of blocks available in
the memory cache. Keeping the 512 KB
4-way set associative example, the
main RAM would be divided into 2,048
blocks, the same number of blocks
available inside the memory cache.
Each memory block is linked to a set
of lines inside the cache, just like
in the direct mapped cache.
http://www.hardwaresecrets.com/printpage/481/8
They seem to be saying that each cache block(4 cache lines) maps to a particular block of contiguous RAM. They are saying non-contiguous blocks of system memory(RAM) can't map to the same cache block.
This is there picture of how hardwaresecrets thinks it works
http://www.hardwaresecrets.com/fullimage.php?image=7864
Contrast that with wikipedia's picture of set associative cache
http://upload.wikimedia.org/wikipedia/commons/9/93/Cache%2Cassociative-fill-both.png.
Brown disagrees with hardware secrets
Consider what might happen if each
cache line had two sets of fields: two
valid bits, two dirty bits, two tag
fields, and two data fields. One set
of fields could cache data for one
area of main memory, and the other for
another area which happens to map to
the same cache line.
http://www.spsu.edu/cs/faculty/bbrown/web_lectures/cache/
That is, non-contiguous blocks of system memory can map to the same cache block.
How are the relationships between non-contiguous blocks on system memory and cache blocks created. I read somewhere that these relationships are based on cache strides, but I can't find any information on cache strides other than that they exist.
Who is right?
If striding is actually used, how does striding work and do I have the correct technical name? How do I find the stride for a particular system? is it based on the paging system? Can someone point me to a url that explains N-way set associative cache in great detail?
also see:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html

When I teach cache memory architecture to my students, I start with a direct-mapped cache. Once that is understood, you can think of N-way set associative caches as parallel blocks of direct-mapped cache. To understand that both figures may be correct, you need to first understand the purpose of set-assoc caches.
They are designed to work around the problem of 'aliasing' in a direct-mapped cache, where multiple memory locations can map to a specific cache entry. This is illustrated in the Wikipedia figure. So, instead of evicting a cache entry, we can use a N-way cache to store the other 'aliased' memory locations.
In effect, the hardware secrets diagram would be correct assuming the order of replacement is such that the first chunk of main memory is mapped to Way-1 and then the second chunk to Way-2 and so on so forth. However, it is equally possible to have the first chunk of main memory spread over multiple Ways.
Hope this explanation helps!
PS: Contiguous memory locations are only needed for a single cache line, exploiting spatial locality. As for the latter part of your question, I believe that you may be confusing several different concepts.

The replacement policy decides where in the cache a copy of a
particular entry of main memory will go. If the replacement policy is
free to choose any entry in the cache to hold the copy, the cache is
called fully associative. At the other extreme, if each entry in main
memory can go in just one place in the cache, the cache is direct
mapped. Many caches implement a compromise in which each entry in main
memory can go to any one of N places in the cache, and are described
as N-way set associative

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio