Related
Suppose the following hypothetical task:
I am given a single integer A (say, 32 bit double) an a large array of integers B's (same type). The size of the integer array is fixed at runtime (doesn't grow mid-run) but of arbitrary size except it can always fit inside either RAM or VRAM (whichever is smallest). For the sake of this scenario, the integer array can sit in either RAM and VRAM; ignore any time cost in transferring this initial data set at start-up.
The task is to compare A against each B and to return true only if the test is true for against ALL B's, returning false otherwise. For the sake of this scenario, let is the greater than comparison (although I'd be interested if your answer is different for slightly more complex comparisons).
A naïve parallel implementation could involve slicing up the set B and distributing the comparison workload across multiple core. The core's workload would then be entirely independent save for when a failed comparison would interrupt all others as the result would immediately be false. Interrupts play a role in this implementation; although I'd imagine an ever decreasing one probabilistically as the array of integers gets larger.
My question is three-fold:
Would such a scenario be suitable for parallel-processing on GPU. If so, under what circumstances? Or is this a misleading case where the direct CPU implementation is actually the fastest?
Can you suggest an improved parallel algorithm over the naïve one?
Can you suggest any reading to gain intuition on deciding such problems?
If I understand your questions correctly, what you are trying to perform is a reductive operation. The operation in question is equivalent to a MATLAB/Numpy all(A[:] == B). To answer the three sections:
Yes. Reductions on GPUs/multicore CPUs can be faster than their sequential counterpart. See the presentation on GPU reductions here.
The presentation should provide a hierarchical approach for reduction. A more modern approach would be to use atomic operations on shared memory and global memory, as well as warp-aggregation. However, if you do not wish to deal with the intricate details of GPU implementations, you can use a highly-optimized library such as CUB.
See 1 and 2.
Good luck! Hope this helps.
I think this is a situation where you'll derive minimal benefit from the use of a GPU. I also think this is a situation where it'll be difficult to get good returns on any form of parallelism.
Comments on the speed of memory versus CPUs
Why do I believe this? Behold: the performance gap (in terrifyingly unclear units).
The point here is that CPUs have gotten very fast. And, with SIMD becoming a thing, they are poised to become even faster.
In the meantime, memory is getting faster slower. Not shown on the chart are memory buses, which ferry data to/from the CPU. Those are also getting faster, but at a slow rate.
Since RAM and hard drives are slow, CPUs try to store data in "little RAMs" known as the L1, L2, and L3 caches. These caches are super-fast, but super-small. However, if you can design an algorithm to repeatedly use the same memory, these caches can speed things up by an order of magnitude. For instance, this site discusses optimizing matrix multiplication for cache reuse. The speed-ups are dramatic:
The speed of the naive implementation (3Loop) drops precipitously for everything about a 350x350 matrix. Why is this? Because double-precision numbers (8 bytes each) are being used, this is the point at which the 1MB L2 cache on the test machine gets filled. All the speed gains you see in the other implementations come from strategically reusing memory so this cache doesn't empty as quickly.
Caching in your algorithm
Your algorithm, by definition, does not reuse memory. In fact, it has the lowest possible rate of memory reuse. That means you get no benefit from the L1, L2, and L3 caches. It's as though you've plugged your CPU directly into the RAM.
How do you get data from RAM?
Here's a simplified diagram of a CPU:
Note that each core has it's own, dedicated L1 cache. Core-pairs share L2 caches. RAM is shared between everyone and accessed via a bus.
This means that if two cores want to get something from RAM at the same time, only one of them is going to be successful. The other is going to be sitting there doing nothing. The more cores you have trying to get stuff from RAM, the worse this is.
For most code, the problem's not too bad since RAM is being accessed infrequently. However, for your code, the performance gap I talked about earlier, coupled your algorithm's un-cacheable design, means that most of your code's time is spent getting stuff from RAM. That means that cores are almost always in conflict with each other for limited memory bandwidth.
What about using a GPU?
A GPU doesn't really fix things: most of your time will still be spent pulling stuff from RAM. Except rather than having one slow bus (from the CPU to RAM), you have two (the other being the bus from the CPU to the GPU).
Whether you get a speed up is dependent on the relative speed of the CPU, the GPU-CPU bus, and the GPU. I suspect you won't get much of a speed up, though. GPUs are good for SIMD-type operations, or maps. The operation you describe is a reduction or fold: an inherently non-parallel operation. Since your mapped function (equality) is extremely simple, the GPU will spend most of its time on the reduction operation.
tl;dr
This is a memory-bound operation: more cores and GPUs are not going to fix that.
ignore any time cost in transferring this initial data set at
start-up
if there are only a few flase conditions in millions or billions of elements, you can try an opencl example:
// A=5 and B=arr
int id=get_global_id(0);
if(arr[id]!=5)
{
atomic_add(arr,1);
}
is as fast as it gets. arr[0] must be zero if all conditions are "true"
If you are not sure wheter there are only a few falses or millions(which makes atomic functions slow), you can have a single-pass preprocessing to decrease number of falses:
int id=get_global_id(0);
// get arr[id*128] to arr[id*128+128] into local/private mem
// check if a single false exists.
// if yes, set all cells true condition except one
// write results back to a temporary arr2 to be used
this copies whole array to another but if you can ignore time delta of transferring from host device, this should be also ignored. On top of this, only two kernels shouldn't take more than 1ms for the overhead(not including memory read writes)
If data fits in cache, the second kernel(one with the atomic function) will access it instead of global memory.
If time of transfers starts concerning, you can hide their latency using pipelined upload compute download operations if threads are separable from whole array.
In implementing most algorithms (sort, search, graph traversal, etc.), there is frequently a trade-off that can be made in reducing memory accesses at the cost of additional ordinary operations.
Knuth has a useful method for comparing the complexity of various algorithm implementations by abstracting it from particular processors and only distinguishing between ordinary operations (oops) and memory operations (mems).
In compiled programs, one typically lets the compiler organise the low level operations, and hopes that the operating system will handle the question of whether data is held in cache memory (faster) or in virtual memory (slower). Furthermore, the exact number / cost of instructions is encapsulated by the compiler.
With Forth, there is no longer such encapsulation, and one is much closer to the machine, albeit perhaps to a stack machine running on top of a register processor.
Ignoring the effect of an operating system (so no memory stalls, etc.), and assuming for the moment a simple processor,
(1) Can anyone advise on how the ordinary stack operations in Forth (e.g. dup, rot, over, swap, etc.) compare with the cost of Forth's memory access fetch (#) or store (!) ?
(2) Is there a rule of thumb I can use to decide how many ordinary operations to trade-off against saving a memory access?
What I'm looking for is something like 'memory access costs as much as 50 ordinary ops, or 500 ordinary ops, or 5 ordinary ops' Ballpark is absolutely fine.
I'm trying to get a sense of the relative expense of fetch and store vs. rot, swap, dup, drop, over, correct to an order of magnitude.
This article How much time does it take to fetch one word from memory? talks about main memory stall times, with some rule of thumb type numbers, but basically you can do lots of instructions while stalling for main memory. As others have said, the numbers vary a lot between systems.
Main memory stalls is a big area of interest, especially as CPUs have more cores, but typically not much faster memory bandwidth. There is some research going on around compressing data in main memory too, so that the CPU can take advantage of 'spare' cycles and tightly packed cache lines http://oai.cwi.nl/oai/asset/15564/15564B.pdf
For those who are really interested in the details, most CPU manufacturers publish in depth guides on memory optimisations etc. mostly aimed at high end and compiler writers, but readable by all 2gl and 3gl programmers.
Ps. Go Forth.
A comparison between memory fetches and register operations is okay for assembler programs, as it is for the output of c-compilers, which is in fact an assembler program.
In Forth this question hardly makes sense. In the first place Forth is an interpreter and in using Forth one foregoes the ultimate in speed. Of course one could add an optimiser on top of Forth but then the question makes even less sense, because the output of a c-optimiser and a Forth optimiser converge to -- you guessed it -- an optimal solution.
Let's look at an elementary operation in Forth like AND.
This is implemented as
> CODE AND
> POP AX
> POP BX
> AND AX, BX
> PUSH AX
> NEXT
So we see already three memory operations for something that looks like an elementary calculation operation. It appears the Knuth metric is not applicable. Also Forth seems to be loosing big time.That is however not true. Those memory operations are all onto the L1 cache of a typical processor. That is about as efficient as local variables in small c functions,
We can compare stack operations with memory operations using VARIABLE's and the stack. The answer is simple. A VARIABLE risks a memory stall. A stack operation will almost certainly be a L1 cache hit. This is the single most important point of consideration. However the question explicitly asks not to consider it!
So there.
The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.
This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.
How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache),
i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.
Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.
Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?
I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..
The variety will help to better to understand it deeply.
The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).
Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.
Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.
Improving cache line utilization helps in three respects:
It tends to fit more useful data in the cache, essentially increasing the effective cache size.
It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
It reduces the memory bandwidth requirements, as there will be fewer fetches.
Common techniques are:
Use smaller data types
Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.
We should also note that there are other ways to hide memory latency than using caches.
Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.
Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).
To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.
Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.
While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.
These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.
I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.
The cache works on two principles: Temporal locality and spatial locality.
The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.
The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.
And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.
A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.
The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.
Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.
In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.
And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.
I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.
Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).
Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.
This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.
Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.
The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.
Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:
It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.
Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.
One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.
So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.
In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.
This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.
A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)
I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:
measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
sum += A[ x + y*N ];
measure::stop();
and the second case with the for loops swapped.
The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)
I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold
Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.
I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.
Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.
Data structures that are contained within a single cache-line are therefore more efficient.
Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.
Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.
To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.
I suggest reading about Optimization, there are some good answers on this site.
In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.
(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)
There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.
The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.
This type of pattern best applies to procedures that
could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
takes a batch of input and do aforementioned multiple steps on them to get result.
Let's take a simple case where there is only one sub-procedure.
Normally the code would like:
def proc(input):
return sub-step(input))
To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.
def batch_proc(inputs):
results = []
for i in inputs:
// avoids code cache miss, but still suffer data(inputs) miss
results.append(sub-step(i))
return res
However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:
def batch_pipelined_proc(inputs):
for i in range(0, len(inputs)-1):
prefetch(inputs[i+1])
# work on current item while [i+1] is flying back from RAM
results.append(sub-step(inputs[i-1]))
results.append(sub-step(inputs[-1]))
The execution flow would look like:
prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
works_on(0) cold miss on 0 and works on it, which takes M
prefetch(2) issue another fetch
works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
works_on(2) ...
There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design:
http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.
More information could be found:
https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf
Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.
Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.
Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.
In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:
for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)
The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.
By deterministic I vaguely mean that can be used in critical real-time software like aerospace flight software. Garbage collectors (and dynamic memory allocation for that matter) are big no-no's in flight software because they are considered non-deterministic. However, I know there's ongoing research on this, so I wonder if this problem has been solved yet.
I'm also including in the question any garbage collection algorithms that put restrictions on how they're used.
I know I might get a lot of down-votes for this reply, but if you are already trying to avoid dynamic memory in the first place, because you said it's a no-no, why do you use GC at all? I'd never use GC in a real-time system where predictable runtime speed is the major concern. I'd avoid dynamic memory wherever possible, thus there are very, very little dynamic objects to start with and then I'd handle the very few dynamic allocations I have manually, so I have 100% control when something is released and where it is released. After all not just GC is not deterministic, free() is as little deterministic as malloc() is. Nobody says that a free() call just has to mark the memory as free. It might as well try to combine smaller free memory blocks surrounding the free'd one to a big one and this behavior is not deterministic, nor is the runtime for it (sometimes free won't do that and malloc will do that instead on next allocation, but nowhere is written that free mustn't do that).
In a critical realtime system, you might even replace the system standard malloc()/free() with a different implementation, maybe even writing your own one (it's not as hard as it sounds! I've done that before just for the fun of it) that works most deterministic. For me GC is a plain convenience thingy, it is to get programmers away from focusing on sophisticated malloc()/free() planing and instead having the system deal with this automatically. It helps doing rapid software development and saves hours of debugging working finding and fixing memory leaks. But just like I'd never use GC within an operating system kernel, I'd never use it within a critical realtime application either.
If I need a more sophisticated memory handling, I'd maybe write my own malloc()/free() that works as desired (and most deterministic) and write my own reference counting model on top of it. Reference counting is still manual memory management, but much more comfortable than just using malloc()/free(). It is not ultra fast, but deterministic (at least increasing/decreasing the ref counter is deterministic in speed) and unless you may have circular references, it will catch all dead memory if you follow a retain/release strategy throughout your application. The only non deterministic part about is that you won't know if calling release will just decrease the ref counter or really free the object (depending if the ref count goes to zero or not), but you could delay the actual free by offering a function to say "releaseWithoutFreeing", which decreases the ref counter by one, but even if it reaches zero, it won't free() the object yet. Your malloc()/free() implementation can have a function "findDeadObjects" that searches for all objects with a retain counter of zero, that have not yet been released and free them (at a later point, when you are in a less critical part of your code that has more time for such kind of tasks). Since this is also not deterministic, you could limit the amount of time it may use for this like "findDeadObjectsForUpTo(ms)", and ms is the amount of milliseconds it may use for finding and freeing them, coming back as soon as this time quantum has been used, so you won't spent too much time in this task.
Metronome GC and BEA JRockit are two deterministic GC implementations that I'm aware of (both for Java).
Happened to be searching through Stack Overflow and noticed this rather old post.
Jon Anderson mentioned JamaicaVM. Since these posts have been up for over 4 years now,
I think its important to respond to some of the information posted here.
I work for aicas, the developers and marketers of JamaicaVM, JamaicaCAR, and Veriflux.
JamaicaVM does have a hard realtime garbage collector. It is fully preemptive. The exact
same behavior required in a realtime operating system. Although the preemption latency is
CPU speed dependent, assume that on a Ghz class processor preemption of the garbage collector is less than 1 microsecond. There is a 32 bit singlecore version that supports up to 3 GB of memory per process address space. There is a 32 bit multicore version that supports 3 GB of memory per process address space and multiple cores. There are also 64 bit singlecore and multicore versions that support up to 128 GB of memory per process address space. The performance of the garbage collector is independent of the size of memory. In response to one of the responses regarding running the GC completely out of memory, for a hard realtime system you would not design your program to ever do that. Although you can, in fact, use a hard realtime GC in this scenario, you would have to account for a worst case execution time that probably would not be acceptable to your application.
Instead, the correct approach would be to analyze your program for maximum memory allocation, and then configure the hard realtime garbage collector to incrementally free blocks during all previous allocations so that the specific scenario described never occurs. This is known as thread-distributed, work-paced garbage collection.
Dr. Siebert's book on Hard Realtime Garbage Collectors describes how to accomplish this and presents a formal proof that the garbage collector will keep up with the application, while not becoming an O(N) operation.
It is very important to understand that realtime garbage collection means several things:
The garbage collector is preemptible, just like any other operating system service
It can be proven, mathematically that the garbage collector will keep up, such that memory will not be exhausted because some memory has not been reclaimed yet.
The garbage collector does not fragment memory, such that as long as there is memory available, a memory request will succeed.
Additionally, you will need this to be part of a system with priority inversion protection, a fixed priority thread scheduler and other features. Refer to the RTSJ for some information on this.
Although hard realtime garbage collection is needed for safety-critical applications, it can be used mission critical, and general purpose Java applications as well. There is no inherent limitations in using a hard realtime garbage collector. For general use, you can expect smoother program execution since there are no long garbage collector pauses.
To me, 100% real-time Java is still very much a hit-and-miss technology, but I don't claim to be an expert.
I'd recommend reading up on these articles - Cliff Click blog. He's the architect of Azul, has pretty much coded all of the standard 1.5 Java concurrent classes etc... FYI, Azul is designed for systems which require very large heap sizes, rather than just standard RT requirements.
It's not GC, but there are simple O(1) fixed sized block allocation/free schemes you can use for simple usage. For example, you can use a free list of fixed sized blocks.
struct Block {
Block *next;
}
Block *free_list = NULL; /* you will need to populate this at start, an
* easy way is to just call free on each block you
* want to add */
void release(void *p) {
if(p != NULL) {
struct Block *b_ptr = (struct Block *)p;
b_ptr->next = free_list;
free_list = b_ptr;
}
}
void *acquire() {
void *ret = (void *)free_list;
if(free_list != NULL) {
free_list = free_list->next;
}
return ret;
}
/* call this before you use acquire/free */
void init() {
/* example of an allocator supporting 100 blocks each 32-bytes big */
static const int blocks = 100;
static const int size = 32;
static unsigned char mem[blocks * size];
int i;
for(i = 0; i < blocks; ++i) {
free(&mem[i * size]);
}
}
If you plan accordingly, you could limit your design to only a few specific sizes for dynamic allocation and have a free_list for each potential size. If you are using c++, you can implement something simple like scoped_ptr (for each size, i'd use a template param) to get simpler yet still O(1) memory management.
The only real caveat, is that you will have no protection from double frees or even accidentally passing a ptr to release which didn't come from acquire.
Sun has extensively documented their real-time garbage collector, and provided benchmarks you can run for yourself here. Others mentioned Metronome, which is the other major production-grade RTGC algorithm. Many other vendors of RT JVMs have their own implementations -- see my list of vendors over here and most of them provide extensive documentation.
If your interest is particularly in avionics/flight software, I suggest you take a look at aicas, an RTSJ vendor who specifically markets to the avionics industry. Dr. Siebert's (aicas CEO) home page lists some academic publications that go into great detail about PERC's GC implementation.
You may have some luck with the following PhD thesis
CMU-CS-01-174 - Scalable Real-time Parallel Garbage Collection for Symmetric Multiprocessors.
Real-time means a guaranteed upper bound on response time. This means an upper bound on the instructions that you can execute until you deliver the result. This also puts an upper limit on the amount of data you can touch. If you don't know how much memory you're going to need, it is extremely likely that you'll have a computation for which you cannot give an upper limit of its execution time. Otoh, if you know the upper bound of your computation, you also know how much memory gets touched by it (unless you don't really know what your software does). So, the amount of knowledge you have about your code obviates the need for a GC.
There are features, like regions in RT-Java, that allow for expressiveness beyond local and global (static) variables. But they will not relieve you from your obligation to manage the memory you allocate, because otherwise you cannot guarantee that the next upcoming allocation will not fail because of insufficient memory resources.
Admittedly: I've gotten somewhat suspicious about things that call themselves "realtime garbage collectors". Of course, any GC is real time if you assume that every allocation runs a full collection (which still doesn't guarantee that it will succeed afterwards, because all memory blocks might found to be reachable). But for any GC that promises a lower time bound on allocation, consider its performance on the following example code:
// assume that on `Link` object needs k bytes:
class Link {
Link next = null;
/* further fields */
static Link head = null;
}
public static void main (String args) {
// assume we have N bytes free now
// set n := floor (N/k), assume that n > 1
for (int i = 0; i < n; i ++) {
Link tmp = new Link ();
tmp.next = Link.head;
Link.head = tmp;
}
// (1)
Link.head = Link.head.next; // (2)
Link tmp = new Link (); // (3)
}
At point (1), we have less than k
bytes free (allocation of another
Link object would fail), and all
Link objects allocated so far are
reachable starting from the
Link.static Link head field.
At point (2),
(a) what used to be the first entry in the list is now not reachable, but
(b) it is still allocated, as far as the memory management part is concerned.
At
point (3), the allocation should
succeed because of (2a) - we can use
what used to be the first link - but,
because of (2b), we must start the
GC, which will end up traversing n-1
objects, hence have a running time
of O(N).
So, yes, it's a contrived example. But a GC that claims to have a bound on allocation should be able to master this example as well.
I know this post is a bit dated, but I have done some interesting research and want to make sure this is updated.
Deterministic GC can be offered by Azul Systems "Zing JVM" and JRocket. Zing comes with some very interesting added features and is now "100% software based" (can run on x86 machines). It is only for Linux at this time though ...
Price:
If you are on Java 6 or before Oracle is now charging a 300% uplift and forcing support for this capability ($15,000 per processor & $3,300 support). Azul, from what I have heard is around $10,000 - $12,000, but charges by physical machine, not core / processor. Also, the process are graduated by volume so the more servers you leverage the deeper the discounting. My conversations with them showed them to be quite flexible. Oracle is a perpetual license and Zing is subscription based ... but if you do the math and add in other features that Zing has (see differences below).
You can cut cost by moving to Java 7, but then incur development costs. Given Oracle's roadmap (a new release every 18 months or so), and the fact that they historically only offer the latest plus one older versions of Java SE updates for free, the "free" horizon is expected to be 3 years from the initial GA release if any major version. Since initial GA releases are typically not adopted in production for 12-18 months, and that moving production systems to new major java releases typically carries major costs, this means that Java SE support bills will start hitting somewhere between 6 and 24 months after initial deployment.
Notable differences:
JRocket does still have some scalability limitations in terms of RAM (though improved from days of old). You can improve your results with a bit of tuning. Zing has engineered their algorithm to allow continuous, concurrent, compaction (no stop the world pauses and no "tuning" required). This allows Zing to scale without a theoretical memory ceiling (they are doing 300+ GB heaps without suffering stop the world or crashing). Talk about a paradigm changer (think of the implications to big data). Zing has some really cool improvements to locking giving it amazing performance with a bit of work (if tuned, can go sub-millisecond average). Finally, they have visibility into classes, methods, and thread behavior in production (no overhead). We are considering this as a huge time saver when considering updates, patches, and bug-fixes (e.g. leaks & locks). This can practically eliminate the need to recreate many of the issues in Dev / Test.
Links to JVM Data I found:
JRocket Deterministic GC
Azul Presentation - Java without Jitter
Azul / MyChannels Test
I know azul systems has a jvm whose GC is hardware assisted. It can also run concurrently and collect massive amounts of data pretty fast.
Not sure how deterministic it is though.