How to avoid mostly untaken conditional branches? - performance

Consider the following situation:
You have a macro doing something very often throughout the whole code. (For example some Exception handling)
This macro will usually do very little, but periodically, certain circumstances arise such that the macro must do a lot more...
This could be easily implemented using conditional branches to choose if the complex or the simple code is needed, and branch if the complex code is needed... BUT this may lead to the following grave performance issue:
Many modern branch predictors use the same prediction structures for multiple branches, such that the data collected from other branches affects the prediction of every single branch! Thus, the overwhelming count of branches not taken most of the time may "confuse" the branch predictor, such that it will make horrible predictions for the other branches!
How can I get around this issue?
Note that, because the complex code is called very very seldomly, I really don't care about the efficiency in that case!
(A starting point for research may be: how do languages like java get around that issue?)

Related

gcc likely unlikely macro usage

I am writing a critical piece of code with roughly the following logic
if(expression is true){
//do something with extremely low latency before the nuke blows up. This branch is entered rarely, but it is the most important case
}else{
//do unimportant thing that doesnt really matter
}
I am thinking to use likely() macro around the expression, so when it hits the important branch, I get minimum latency.
My question is that the usage is really opposite of the macro name suggest because I am picking the unlikely branch to be pre-fetch, i.e., the important branch is unlikely to happen but it is the most critical thing when it happens.
Is there a clear downside of doing this in terms of performance?
Yes. You are tricking the compiler by tagging the unlikely-but-must-be-fast branch as if it were the likely branch, in hopes that the compiler will make it faster.
There is a clear downside in doing that—if you don't write a good comment that explains what you're doing and why, some maintainer (possibly you yourself) in six months is almost guaranteed to say, "Hey, looks like he put the likely on the wrong branch" and "fix" it.
There is also a much less likely but still possible downside, that some version of some compiler that you use now or in the future will do different things than you're expecting with the likely macro, and those different things will not be what you wanted to trick the compiler into doing, and you'll end up with code that, every time through the loop, spends $100K speculatively getting 90% of the way through reactor shutdown before undoing it.
It's absolutely opposite of the traditional use of __builtin_expect(x, 1), which is used in the sense of the macro:
#define likely(x) __builtin_expect(x, 1)
which I would personally consider to be bad form (since you're cryptically marking the unlikely path as likely for a performance gain). However, you still could mark this optimization, as __builtin_expect(x) makes no assumptions about your needs by claiming a path "likey" - that's just the standard use.To do what you want, I'd suggest:
#define optimize_path(x) __builtin_expect(x, 1)
which will do the same thing, but rather than making the code accuse the unlikely path of being likely, you're now making the code describe what you're really attempting -- to optimize the critical path.
However, I should say that if you're planning on timing a nuke - you should not only be hand checking (and timing) the compiled assembly so that the timing is correct, but you should also be using a RTOS. A branch misprediction will have an extraordinarily insignificant effect, to the point that it's almost unnecessary here, since you can compensate for the "1 in a million" event by simply having a faster processor or correctly timing the delay for a mispredict. What does affect modern computer timings is OS preemption and scheduling. If you need something to happen on a very discrete timescale, you should be scheduling them for real-time, not psuedo-real time that most general purpose operating systems have. Branch misprediction is generally hundreds of times smaller than the delay that can occur from not using RTOS in an RT situation. Typically if you believe branch misprediction might be a problem, you remove the branch from time-sensitive issue, as the branch predictor typically has a state that is complex and out of your control. Macro's like "likely" and "unlikely" are for blocks of code that can be hit from various areas, with various branch prediction states, and most importantly are used very frequently. The high frequency of hitting these branches leads to a tangible increase in performance for applications that use it (like the Linux Kernel). If you only hit the branch once, you might get a 1 nanosecond performance boost in some cases, but if an application is ever that time critical, there are other things you can do to help yourself to much larger increases in performance.

Preventing performance regressions in R

What is a good workflow for detecting performance regressions in R packages? Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
What is a good workflow in general? What other languages provide good tools? Is it something that can be built on top unit testing, or that is usually done separately?
This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.
What is a good workflow for detecting performance regressions in R packages?
In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.
My steps:
Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
Standardize the data set that will be used for speed testing.
Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
Repeat tests multiple times.
Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).
Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
Unfortunately, I don't have an answer for this.
What is a good workflow in general?
For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.
Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.
What other languages provide good tools?
Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).
As far as language-agnostic methods, you can look at:
Valgrind & cachegrind
Monitoring of disk I/O, dirty buffers, etc.
Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
Usage of multiple cores
Is it something that can be built on top unit testing, or that is usually done separately?
This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.
Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.
So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:
Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
Any job priority manager that you don't know about.
One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.
The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)
When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.
The only way to do anything here is to make some assumptions. So let us assume an unchanged machine, or else require a 'recalibration'.
Then use a unit-test alike framework, and treat 'has to be done in X units of time' as just yet another testing criterion to be fulfilled. In other words, do something like
stopifnot( timingOf( someExpression ) < savedValue plus fudge)
so we would have to associate prior timings with given expressions. Equality-testing comparisons from any one of the three existing unit testing packages could be used as well.
Nothing that Hadley couldn't handle so I think we can almost expect a new package timr after the next long academic break :). Of course, this has to be either be optional because on a "unknown" machine (think: CRAN testing the package) we have no reference point, or else the fudge factor has to "go to 11" to automatically accept on a new machine.
A recent change announced on the R-devel feed could give a crude measure for this.
CHANGES IN R-devel UTILITIES
‘R CMD check’ can optionally report timings on various parts of the check: this is controlled by environment variables documented in ‘Writing R Extensions’.
See http://developer.r-project.org/blosxom.cgi/R-devel/2011/12/13#n2011-12-13
The overall time spent running the tests could be checked and compared to previous values. Of course, adding new tests will increase the time, but dramatic performance regressions could still be seen, albeit manually.
This is not as fine grained as timing support within individual test suites, but it also does not depend on any one specific test suite.

How to refactor rapidly evolving code?

I have some research code that's a real rat's nest, with code duplication everywhere, and clearly needs to be refactored. However, the code base is evolving as I come up with new variations on the theme and fit them into the codebase. The reason I've put off refactoring so long is because I feel like the minute I spend a few days coming up with good abstractions, seeing what design patterns fit where, etc., I'll want to try out some new unforeseen idea that makes my abstractions completely inadequate. In other words, because of the rate at which the code is evolving, I really have no idea where abstraction lines belong, even though there is no shortage of (approximate) duplication and the general messiness of the code makes adding stuff to it a real pain. What are some general best practices for coping with this kind of situation?
Don't spend so long refactoring!
When you're about make a change in a piece of code, consider refactoring it to make the change easier.
After making the change, refactor again to clean up the damage done by that change.
In both cases, make the refactorings small and do them quickly, and move on.
You don't have to keep your code pristine at all times, but remember that it's easier to go fast if you have well-factored code to work in (and if you have good unit tests, of course).
Test Driven Development:
Red, Green, Refactor. Rinse, repeat.
Since it's one of the steps in every single cycle, you'll notice that's a LOT of usually minor refactoring taking place. That's the way it should be.
Your situation is pretty familiar to me. While doing investigative coding often you have no idea what the "right" abstraction will be, and as you say it can change with every new idea.Other posters have suggested:
Continuous small refactoring, which helps to avoid getting into the rats-nest situation
Test-Driven Development, which helps to find good, re-usable abstractions. It's important to note that TDD is less about testing than about doing good designs!
However, for investigative research code there is another strategy: the prototype. This seems to be what you are currently doing: coding as quickly as possible to prove a concept. There's nothing wrong with that, but a prototype should always be throw-away. Tweak it until you have all the necessary input and knowledge, then throw away the code and start over with TDD and continuous refactoring, and all your other "doing the things right" strategies.
Don't keep any of the code. Don't copy-paste anything. Don't refer back to it. Just start over with your new knowledge.
Clean up the code a little bit at a time. Always when you touch a class, try to leave the class cleaner that it was before you touched it ("the boy scout rule"). Refactoring is best done in very small steps, but very often.
Things like renaming some variable, splitting a method etc. take only some seconds or minutes. Large refactorings such as splitting or joining classes, may take an hour or two (and you make it in small steps, so that all tests pass at least every five minutes - otherwise you have entered Refactoring Hell and you should revert to the last known working state). If it takes days or weeks for you to refactor something, then it's not anymore "refactoring" - it's more like rewriting.
An article about this topic:
http://blog.objectmentor.com/articles/2007/07/20/whats-your-unit-of-measure
Put it in Distributed SCM like Git at least, that way when you break something refactoring you can reverse time divisibly to find the commit prior to the change, as well as being able to work on changes and commit them in branches without interfering with others work.
Gits Branch merge is great for things like this and you'll know easily if 2 people made incompatible changes in parallel without having to worry about the rest of the code.
For the above reasons, I would also create a seperate branch in the repository just for re factoring code with, and keep it up-dated regularly. This way, not only will others not interfere with your progress, but they can keep an eye on it and see changes in it that will eventually hit the main branch so they can pre-emptively code around those changes.
If you already know where there is duplication, you don't need several days to refactor it away.
Sometimes a rewrite is the only choice. This seems to be the case.
The CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP and many other languages.
When it shows a parameterized abstraction of a set of found clones, it is essentially proposing that you refactor the code with that abstraction implemented (as a method, a function, a class, ...).
So running the CloneDR gets a list of potential abstractions to be added to your code, and replacing the clone instances by calls on the abstraction refactors your code thus cleaning it up (somewhat).
Even more remarkably, when it shows the parameter bindings used at each clone site needed to invoke the abstraction, it often shows a bungled clone instance, easily recognized when the bound paramters are conceptually inconsistent. If a parameer is bound to variables named YYYY-MM-DD, and one of them is YY-MM-DD, the "its a 4 digit-year" parameter type looks violated and in this this case there's a broken Y2K remediation. So examining the clone bindings often finds bugs.
This is a very common problem in scientific computing. Some of the most effective ideas for reducing the size and complexity of code require leveraging assumptions, and science demands that you constantly change those assumptions.
All you can do is try to refactor your code as you go, and try not to write yourself into any corners. Also work with good people who understand the value of not making a mess.

Code replication and refactoring

I would like to hear opinions on small amounts of code replication within methods that
check for the same condition
e.g.
While(condition){
...... do x
}
Normally if there was any of this kind of replication I would refactor the code as it can make versioning a nightmare, what if the condition changes for example you have to change every instance, not a nice job to do.
However what if the condition is relatively simple and is only used within say 3 methods is it wise to refactor,
So in summary where do people draw the line at refactoring code?
Refactoring is not free. Any change to the code can introduce bugs. So every concious developer thinks about whether he has time to carefully examine the changes and in many cases decides not to refactor.
It depends on 2 things, IMO - the size of the duplication (how many lines are involved), and the locality of the duplications - how 'close' are they in terms of context.
If you have duplicated code in several methods in the same class then I would consider extracting even duplicates of a single line into a seperate method (assuming that the line in question was an easly identifiable, easy to isolate, fairly uncoupled piece of code).
Alternatively, if I had some code in one part of a project, and an almost identical piece of code in an (almost) unrelated area then I wouldnt factor that out, as the scope for future divergence would seem quitre high....
The key to refactoring is to do it when you need to. In the above example if you have 3 different while loops with the same condition, who is to say in the future you might want different conditions, if you've already refactored then you've introduced a potential error situation.
It's a matter of judgement, the same condition three times seems ok, the same condition 10 times is an obvious refactor, but where is the tipping point?
I am personally usually quite aggressive with refactorings. I believe that if you clean your code regularly you mostly need to do small and simple refactorings. If you leave it for a while, it gets so messy and difficult to maintain.
In your particular case, I would definitely refactor if the condition has a reasonable business meaning, because it will make the code more readable. But even if this is a technicality, I would consider refactoring provided that its just the matter of extracting a function or property.
Ideally you should have unit tests that will make sure that your refactoring is still correct, so the cost of doing it should be really a few minutes, often smaller than writing this response.
A common rule of thumb is the Rule of Three introduced by Martin Fowler in the seminal work Refactoring. The rule says that two things that are basically the same can stay, but once you add a third you should refactor them.
Besides making future changes easier as you mention, refactoring helps with readability and can make intention more obvious.

Effects of branch prediction on performance?

When I'm writing some tight loop that needs to work fast I am often bothered by thoughts about how the processor branch prediction is going to behave. For instance I try my best to avoid having an if statement in the most inner loop, especially one with a result which is not somewhat uniform (say evaluates to true or false randomly).
I tend to do that because of the somewhat common knowledge that the processor pre-fetches instructions and if it turned out that it mis-predicted a branch then the pre-fetch is useless.
My question is - Is this really an issue with modern processors? How good can branch prediction expected to be?
What coding patterns can be used to make it better?
(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)
Branch prediction is pretty darned good these days. But that doesn't mean the penalty of branches can be eliminated.
In typical code, you probably get well over 99% correct predictions, and yet the performance hit can still be significant. There are several factors at play in this.
One is the simple branch latency. On a common PC CPU, that might be in the order of 12 cycles for a mispredict, or 1 cycle for a correctly predicted branch. For the sake of argument, let's assume that all your branches are correctly predicted, then you're home free, right? Not quite.
The simple existence of a branch inhibits a lot of optimizations.
The compiler is unable to reorder code efficiently across branches. Within a basic block (that is, a block of code that is executed sequentially, with no branches, one entry point and one exit), it can reorder instructions as it likes, as long as the meaning of the code is preserved, because they'll all be executed sooner or later. Across branches, it gets trickier. We could move these instructions down to execute after this branch, but then how do we guarantee they get executed? Put them in both branches? That's extra code size, that's messy too, and it doesn't scale if we want to reorder across more than one branch.
Branches can still be expensive, even with the best branch prediction. Not just because of mispredicts, but because instruction scheduling becomes so much harder.
This also implies that rather than the number of branches, the important factor is how much code goes in the block between them. A branch on every other line is bad, but if you can get a dozen lines into a block between branches, it's probably possible to get those instructions scheduled reasonably well, so the branch won't restrict the CPU or compiler too much.
But in typical code, branches are essentially free. In typical code, there aren't that many branches clustered closely together in performance-critical code.
"(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)"
Excellent. Then you can profile your application's performance, use gcc's tags to make a prediction and profile again, use gcc's tags to make the opposite prediction and profile again.
Now imagine theoretically a CPU that prefetches both branch paths. And for subsequent if statements in both paths, it will prefetch four paths, etc. The CPU doesn't magically grow four times the cache space, so it's going to prefetch a shorter portion of each path than it would do for a single path.
If you find half of your prefetches being wasted, losing say 5% of your CPU time, then you do want to look for a solution that doesn't branch.
If we're beyond the "early optimization" phase, then surely we're beyond the "I can measure that" phase as well? With the crazy complexities of modern CPU architecture, the only way to know for sure is to try it and measure. Surely there can't be that many circumstances where you will have a choice of two ways to implement something, one of which requires a branch and one which doesn't.
Not exactly an answer, but you can find here an applet demonstrates the finite state machine often used for table-based branch-prediction in modern microprocessors.
It illustrates the use extra logic to generate a fast (but possibly wrong) estimate for the branch condition and target address.
The processor fetches and executes the predicted instructions at full speed, but needs to revert all intermediate results when the prediction turns out to having been wrong.
Yes, branch prediction really can be a performance issue.
This question (currently the highest-voted question on StackOverflow) gives an example.
My answer is:
The reason AMD has been as fast or better than Intel at some points is the past is simply that they had better branch prediction.
If your code has no branch prediction, (Meaning it has no branches), then it can be expected to run faster.
So, conclusion: avoid branches if they're not necessary. If they are, try to make it so that one branch is evaluated 95% of the time.
One thing I recently found (on a TI DSP) is that trying to avoid branches can sometimes generate more code than the branch prediction cost.
I had something like the following in a tight loop:
if (var >= limit) { otherVar = 0;}
I wanted to get rid of the potential branch, and tried changing it to:
otherVar *= (var<limit)&1;
But the 'optimization' generated twice as much assembly and was actually slower.

Resources