Does msync performance depend on the size of the provided range? - memory-management

I'm making many small random writes to a mmaped file. I want to ensure consistency, so from time to time I use msync, but I don't want to keep track of every single small write that I made. In the current Linux kernel implementation, is there a performance penalty for using msync on the whole file? For example if the file is 100GB, but I only made total of 10MB changes? Is the kernel looping over every page in the range provided for msync to find the dirty pages to flush or are those kept on some sort of linked list/other structure?

TL;DR: no it doesn't, kernel structures that hold the needed information are designed to make the operation efficient regardless of range size.
Pages of mappable objects are kept in a radix tree, however the Linux kernel implementation of radix trees has an additional special feature: entries can be marked with up to 3 different marks, and marked entries can be found and iterated on a lot faster. The actual data structure used is called "XArray", you can find more information about it in this LWN article or in Documentation/core-api/xarray.rst.
Dirty pages have a special mark which can be set (PAGECACHE_TAG_DIRTY) allowing for them to be quickly found when writeback is needed (e.g. msync, fsync, etc). Furthermore, XArrays provide an O(1) mechanism to check whether any entry exists with a given mark, so in the case of pages it can be quickly determined whether a writeback is needed at all even before looking for dirty pages.
In conclusion, you should not incur in a noticeable performance penalty when calling msync on the entire mapping as opposed to only a smaller range of actually modified pages.

Related

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.
(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?
fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.
Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.
One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.
You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.
Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

Why is LRU better than FIFO?

Why is Least Recently Used better than FIFO in relation to page files?
If you mean in terms of offloading memory pages to disk - if your process is frequently accessing a page, you really don't want it to be paged to disk, even if it was the very first one you accessed. On the other hand, if you haven't accessed a memory page for several days, it's unlikely that you'll do so in the near future.
If that's not what you mean, please edit your question to give more details.
There is no single cache algorithm which will always do well because that requires perfect knowledge of the future. (And if you know where to get that...) The dominance of LRU in VM cache design is the result of a long history of measuring system behavior. Given real workloads, LRU works pretty well a very large fraction of the time. However, it isn't very hard to construct a reference string for which FIFO would have superior performance over LRU.
Consider a linear sweep through a large address space much larger than the available pageable real memory. LRU is based on the assumption that "what you've touched recently you're likely to touch again", but the linear sweep completely violates that assumption. This is why some operating systems allow programs to advise the kernel about their reference behavior - one example is "mark and sweep" garbage collection typified by classic LISP interpreters. (And a major driver for work on more modern GCs like "generational".)
Another example is the symbol table in a certain antique macro processor (STAGE2). The binary tree is searched from the root for every symbol, and the string evaluation is being done on a stack. It turned out that reducing the available page frames by "wiring down" the root page of the symbol tree and the bottom page of the stack made a huge improvement in the page fault rate. The cache was tiny, and it churned violently, always pushing out the two most frequently referenced pages because the cache was smaller than the inter-reference distance to those pages. So a small cache worked better, but ONLY because those two page frames stolen from the cache were used wisely.
The net of all this is that LRU is the standard answer because it's usually pretty good for real workloads on systems that aren't hideously overloaded (VM many times the real memory available), and that is supported by years of careful measurements. However,
you can certainly find cases where alternative behavior will be superior. This is why measuring real systems is important.
Treat the RAM as a cache. In order to be an effective cache, it needs to keep the items most likely to be requested in memory.
LRU keeps the things that were most recently used in memory. FIFO keeps the things that were most recently added. LRU is, in general, more efficient, because there are generally memory items that are added once and never used again, and there are items that are added and used frequently. LRU is much more likely to keep the frequently-used items in memory.
Depending on access patterns, FIFO can sometimes beat LRU. An Adaptive Replacement Cache is hybrid that adapts its strategy based on actual usage patterns.
According to temporal locality of reference, memory that has been accessed recently is more likely to be accessed again soon.

How does one write code that best utilizes the CPU cache to improve performance?

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.
The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.
I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.
I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).
Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.
The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.
One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.
A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold
Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.
I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.
Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.
To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)
There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf
Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.
Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.

Resources