How loop unrolling can cause cache misses

How loop unrolling can cause cache misses - caching

I have read (on Wikipedia) that loop unrolling can cause instruction cache misses but I don't understand how. From my understanding, if the loop is unrolled or not, it will still execute the same instructions with just the difference that the unrolled loop will have fewer loop overhead calls but how does it effect the instruction cache?
I could not find a clear answer about it. There was an answer about it on another StackOverflow question but it didn't provide a complete answer: How can a program's size increase the rate of cache misses?

Unrolling a loop (usually) makes the code larger, because the body of the loop is repeated in the compiled executable. In an ideal situation, the compiler can optimize away code that is shared between iterations, but this isn't always possible.
This increase in code size can force other code out of the instruction cache, causing reduced performance. If the body of the loop and the code it calls no longer fits into cache, performance will be severely reduced.

Related

Most relevant performance indicators for C/C++

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.

Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.

#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

Inlining and Instruction Cache Hit Rates and Thrashing

In this article, https://www.geeksforgeeks.org/inline-functions-cpp/, it states that the disadvantages of inlining are:
3) Too much inlining can also reduce your instruction cache hit rate, thus reducing the speed of instruction fetch from that of cache memory to that of primary memory.
How does inlining affect the instruction cache hit rate?
6) Inline functions might cause thrashing because inlining might increase size of the binary executable file. Thrashing in memory causes performance of computer to degrade.
How does inlining increase the size of the binary executable file? Is it just that it increases the code base length? Moreover, it is not clear to me why having a larger binary executable file would cause thrashing as the two dont seem linked.

It is possible that the confusion about why inlining can hurt i-cache hit rate or cause thrashing lies in the difference between static instruction count and dynamic instruction count. Inlining (almost always) reduces the latter but often increases the former.
Let us briefly examine those concepts.
Static Instruction Count
Static instruction count for some execution trace is the number of unique instructions0 that appear in the binary image. Basically, you just count the instruction lines in an assembly dump. The following snippet of x86 code has a static instruction count of 5 (the .top: line is a label which doesn't translate to anything in the binary):
mov eci, 10
mov eax, 0
.top:
add eax, eci
dec eci
jnz .top
The static instruction count is mostly important for binary size, and caching considerations.
Static instruction count may also be referred to simply as "code size" and I'll sometimes use that term below.
Dynamic Instruction Count
The dynamic instruction count, on the other hand, depends on the actual runtime behavior and is the number of instructions executed. The same static instruction can be counted multiple times due to loops and other branches, and some instructions included in the static count may never execute at all and so no count in the dynamic case. The snippet as above, has a dynamic instruction count of 2 + 30 = 32: the first two instructions are executed once, and then the loop executes 10 times with 3 instructions each iteration.
As a very rough approximation, dynamic instruction count is primarily important for runtime performance.
The Tradeoff
Many optimizations such as loop unrolling, function cloning, vectorization and so on increase code size (static instruction count) in order to improve runtime performance (often strongly correlated with dynamic instruction count).
Inlining is also such an optimization, although with the twist that for some call sites inlining reduces both dynamic and static instruction count.
How does inlining affect the instruction cache hit rate?
The article mentioned too much inlining, and the basic idea here is that a lot of inlining increases the code footprint by increasing the working set's static instruction count while usually reducing its dynamic instruction count. Since a typical instruction cache1 caches static instructions, a larger static footprint means increased cache pressure and often results in a worse cache hit rate.
The increased static instruction count occurs because inlining essentially duplicates the function body at each the call site. So rather than one copy of the function body an a few instructions to call the function N times, you end up with N copies of the function body.
Now this is a rather naive model of how inlining works since after inlining, it may be the case that further optimizations can be done in the context of a particular call-site, which may dramatically reduce the size of the inlined code. In the case of very small inlined functions or a large amount of subsequent optimization, the resulting code may be even smaller after inlining, since the remaining code (if any) may be smaller than the overhead involved in the calling the function2.
Still, the basic idea remains: too much inlining can bloat the code in the binary image.
The way the i-cache works depends on the static instruction count for some execution, or more specifically the number of instruction cache lines touched in the binary image, which is largely a fairly direct function of the static instruction count. That is, the i-cache caches regions of the binary image so the more regions and the larger they are, the larger the cache footprint, even if the dynamic instruction count happens to be lower.
How does inlining increase the size of the binary executable file?
It's exactly the same principle as the i-cache case above: larger static footprint means that more distinct pages need to paged in, potentially leading to more pressure on the VM system. Now we usually measure code sizes in megabytes, while memory on servers, desktops, etc are usually measured in gigabytes, so it's highly unlikely that excessive inlining is going to meaningfully contribute to the thrashing on such systems. It could perhaps be a concern on much smaller or embedded systems (although the latter often don't have a MMU at at all).
0 Here unique refers, for example, to the IP of the instruction, not to the actual value of the encoded instruction. You might find inc eax in multiple places in the binary, but each are unique in this sense since they occur at a different location.
1 There are exceptions, such as some types of trace caches.
2 On x86, the necessary overhead is pretty much just the call instruction. Depending on the call site, there may also be other overhead, such as shuffling values into the correct registers to adhere to the ABI, and spilling caller-saved registers. More generally, there may be a large cost to a function call simply because the compiler has to reset many of its assumptions across a function call, such as the state of memory.

Lets say you have a function thats 100 instructions long and it takes 10 instructions to call it whenever its called.
That means for 10 calls it uses up 100 + 10 * 10 = 200 instructions in the binary.
Now lets say its inlined everywhere its used. That uses up 100*10 = 1000 instructions in your binary.
So for point 3 this means that it will take significantly more space in the instructions cache (different invokations of an inline function are not 'shared' in the i-cache)
And for point 6 your total binary size is now bigger, and a bigger binary size can lead to thrashing

If compilers inlined everything they could, most functions would be gigantic. (Although you might just have one gigantic main function that calls library functions, but in the most extreme case all functions within your program would be inlined into main).
Imagine if everything was a macro instead of a function, so it fully expanded everywhere you used it. This is the source-level version of inlining.
Most functions have multiple call-sites. The code-size to call a function scales a bit with the number of args, but is generally pretty small compared to a medium to large function. So inlining a large function at all of its call sites will increase total code size, reducing I-cache hit rates.
But these days its common practice to write lots of small wrapper / helper functions, especially in C++. The code for a stand-alone version of a small function is often not much bigger than code necessary to call it, especially when you include the side-effects of a function call (like clobbering registers). Inlining small functions can often save code size, especially when further optimizations become possible after inlining. (e.g. the function computes some of the same stuff that code outside the function also computes, so CSE is possible).
So for a compiler, the decision of whether to inline into any specific call site or not should be based on the size of called function, and maybe whether its called inside a loop. (Optimizing away the call/ret overhead is more valuable if the call site runs more often.) Profile-guided optimization can help a compiler make better decisions, by "spending" more code-size on hot functions, and saving code-size in cold functions (e.g. many functions only run once over the lifetime of the program, while a few hot ones take most of the time).
If compilers didn't have good heuristics for when to inline, or you override them to be way too aggressive, then yes, I-cache misses would be the result.
But modern compilers do have good inlining heuristics, and usually this makes programs significantly faster but only a bit larger. The article you read is talking about why there need to be limits.
The above code-size reasoning should make it obvious that executable size increases, because it doesn't shrink the data any. Many functions will still have a stand-alone copy in the executable as well as inlined (and optimized) copies at various call sites.
There are a few factors that mitigate the I-cache hit rate problem. Better locality (from not jumping around as much) let code prefetch do a better job. Many programs spend most of their time in a small part of their total code, which usually still fits in I-cache after a bit of inlining.
But larger programs (like Firefox or GCC) have lots of code, and call the same functions from many call sites in large "hot" loops. Too much inlining bloating the total code size of each hot loop would hurt I-cache hit rates for them.
Thrashing in memory causes performance of computer to degrade.
https://en.wikipedia.org/wiki/Thrashing_(computer_science)
On modern computers with multiple GiB of RAM, thrashing of virtual memory (paging) is not plausible unless every program on the system was compiled with extremely aggressive inlining. These days most memory is taken up by data, not code (especially pixmaps in a computer running a GUI), so code would have to explode by a few orders of magnitude to start to make a real difference in overall memory pressure.
Thrashing the I-cache is pretty much the same thing as having lots of I-cache misses. But it would be possible to go beyond that into thrashing the larger unified caches (L2 and L3) that cache code + data.

Generally speaking, inlining tends to increase the emitted code size due to call sites being replaced with larger pieces of code. Consequently, more memory space may be required to hold the code, which may cause thrashing. I'll discuss this in a little more detail.
How does inlining affect the instruction cache hit rate?
The impact that inlining can have on performance is very difficult to statically characterize in general without actually running the code and measuring its performance.
Yes, inlining may impact the code size and typically makes the emitted native code larger. Let's consider the following cases:
The code executed within a particular period of time fits within a particular level of the memory hierarchy (say L1I) in both cases (with or without inlining). So performance with respect to that particular level will not change.
The code executed within a particular period of time fits within a particular level of the memory hierarchy in the case of no inlining, but doesn't fit with inlining. The impact this can have on performance depends on the locality of the executed. Essentially, if the hottest pieces of code first within that level of memory, then the miss ratio at the level might increase slightly. Features of modern processors such as speculative execution, out-of-order execution, prefetching can hide or reduce the penalty of the additional misses. It's important to note that inlining does improve the locality of code, which can result in a net positive imapct on performance despite of the increased code size. This is particularly true when the code inlined at a call site is frequently executed. Partial inlining techniques have been developed to inline only the parts of the function that are deemed hot.
The code executed within a particular period of time doesn't fit within a particular level of the memory hierarchy in both cases. So performance with respect to that particular level will not change.
Moreover, it is not clear to me why having a larger binary executable
file would cause thrashing as the two dont seem linked.
Consider the main memory level on a resource-constrained system. Even a mere 5% increase in code size can cause thrashing at main memory, which can result in significant performance degradations. On other resource-rich systems (desktops, workstations, servers), thrashing usually only occurs at caches when the total size of the hot instructions is too large to fit within one or more of the caches.

Dynamically executing large volumes of execute-once, straight-line x86 code

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?

I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.

I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

How come matlab run slower and slower when running a program that takes long time to execute?

There is a program that my matlab runs, since there are two gigantic nested for-loop, we expect this program to run more than 10 hours. We ask matlab to print out the loop number every time when it is looping.
Initially (the first 1 hour), wee see the loop number increment very fast in our screen; as time goes by, it goes slower and slower..... now (more than 20 consecutive hours of executing the same ".m" file and it still haven't finished yet), it is almost 20 times slower than it had been initially.
The ram usage initially was about 30%, right now after 20 hours of executing time, it is as shown below:
My computer spec is below.
What can I do to let matlab maintain its initially speed?

I can only guess, but my bet is that you have some array variables that have not been preallocated, and thus their size increases at each iteration of the for loop. As a result of this, Matlab has to reallocate memory in each iteration. Reallocating slows things down, and more so the larger those variables are, because Matlab needs to find an ever larger chunk of contiguous memory. This would explain why the program seems to run slower as time passes.
If this is indeed the cause, the solution would be to preallocate those variables. If their size is not known beforehand, you can make a guess and preallocate to an approximate size, thus avoiding at least some of the reallocating. Or, if your program is not memory-limited, maybe you can use an upper-bound on variable size when preallocating; then, after the loop, trim the arrays by removing unused entries.

Some general hints, if they don't help I suggest to add the code to the question.
Don't print to console, this output slows down execution and the output is kept in memory. Write a logfile instead if you need it. For a simple status, use waitbar
Verify you are preallocating all variables
Check which function calls depend on the loop index, for example increasing size of variables.
If any mex-functions are used, double check them for memory leaks. The standard procedure to do this: Call the function with random example data, don't store the output. If the memory usage increases there is a memory leak in the function.
Use the profiler. Profile your code for the first n iterations (where n matches about 10 minutes), generate a HTML report. Then let it run for about 2 hours and generate a report for n iterations again. Now compare both reports and see where the time is lost.

I'd like to point everybody to the following page in the MATLAB documentation: Strategies for Efficient Use of Memory. This page contains a collection of techniques and good practices that MATLAB users should be aware of.
The OP reminded me of it when saying that memory usage tends to increase over time. Indeed there's an issue with long running instances of MATLAB on win32 systems, where a memory leak exists and is exacerbated over time (this is described in the link too).
I'd like to add to Luis' answer the following piece of advice that a friend of mine once received during a correspondence with Yair Altman:
Allocating largest vars first helps by assigning the largest free contiguous blocks to the highest vars, but it is easily shown that in some cases this can actually be harmful, so it is not a general solution.
The only sure way to solve memory fragmentation is by restarting Matlab and even better to restart Windows.
More information on memory allocation performance can be found in the following undocumentedMatlab posts:
Preallocation performance
Allocation performance take 2

How do you empty a cache when we you measure function's performance

CPU cache always interrupts what we test a performance of some codes.
gettime();
func1();
gettime();
gettime();
func2();
gettime();
// func2 is faster because of the cache.(or page faults of func1())
// But we often misunderstand.
When you measure your code performance, how do you remove the cache's influence.
I'm finding some functions or ways to do this in Windows.
Please give me your nice tips. Thanks.

One thing you can do is to call a function that has a lot of code and accesses a lot of memory in between calls to the item you are profiling. For example, in pseudo code (to be mostly language neutral):
// loop some number of times
{
//start timing
profile_func();
//stop timing
//add to total time
large_func(); // Uses lots of memory and has lots of code
}
// Compute time of profile func by dividing number of iterations by total time
The code in the large_func() can be nonsense code, like some set of ops repeated over and
over. The key is that it, or its code, does not get optimized out when you compile, so that it actually clears the code and data caches of the CPU (and, the L2 and L3 (if present) caches as well).
This is a very important test for many cases. The reason it is important is that small fast functions that are often profiled in isolation can run very fast, taking advantage of CPU cache, inlining and enregistration. But, often times, in large applications these advantages are absent, because of the context in which these fast functions are called.
As an example, just profiling a function by running it for a million iterations in a tight loop might show that the function executes in say 50 nanoseconds. Then you run it using the framework I showed above, and all of a sudden its running time can drastically increase to microseconds, because it can no longer take advantage of the fact that it has the entire processor - its registers and caches, to itself.

Good code takes advantage of cache, so you can't just turn it off (you can, but these results will be completely irrelevant).
What do you need is to empty (or invalidate) the cache between successive tests. Here are some hints: Invalidating the CPU's cache

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio