Spectre fix impact on sorting performance - performance

One of the most famous stackoverflow questions is why sorting a sorted array is so fast; and the answer is because of branch prediction.
Will the application of Intel's and Microsoft spectre fixes effectively nullify the answer given in this question on the affected processors (older generation Intel processors, AMD Ryzen, and ARM)?

No, the key to Spectre is forcing mis-prediction of indirect branches, because they can jump to any address. It's non-trivial to find a sequence of instructions that loads secret data you want, and then makes another data-dependent load with the secret as an array index.
To attack a regular taken / not-taken conditional branch (like you'd find in a sort function, or that conditional in the loop over a sorted or not-sorted array), you'd need to find a case where executing the "wrong" side of a branch (maybe the wrong side of an if/else in the source) would do something useful when it runs with the "wrong" values in registers. It's plausible1, but unlikely, so most defenses against Spectre will only worry about indirect branches.
Hardware fixes for Spectre will have to be more subtle than "turn off branch prediction" (i.e. stall the pipeline at every conditional branch). That would probably reduce performance by an order of magnitude in a lot of code, and is far too high to be an acceptable defense against a local information leak (which can lead to privilege escalation).
Even turning off prediction for only indirect branches (but not regular conditional branches) may be too expensive for most user-space code, because every shared library / DLL function call goes through an indirect branch in the normal software ecosystem on mainstream OSes (Linux, OS X, Windows).
The Linux kernel is experimenting with a retpoline to defeat indirect-branch prediction for indirect branches inside the kernel. I'm not sure it's enabled by default, though, even in kernels that enable the Meltdown workaround (KPTI).
Footnotes:
Sometimes the wrong case of a switch could do something totally inappropriate (e.g. in an interpreter), and if the switch was compiled with nested branches rather than a single indirect branch then you might be able to attack it. (Compilers often use a table of branch targets for switch, but when the cases are sparse it's not always possible. e.g. case 10 / case 100 / case 1000 / default would need a 990-entry array with only 3 used values.)

Related

Does it cost significant resources for a modern CPU to keep flags updated?

As I understand it, on a modern out of order CPU, one of the most expensive things is state, because that state has to be tracked in multiple versions, kept up-to-date across many instructions etc.
Some instruction sets like x86 and ARM make extensive use of flags, which were introduced when the cost model was not what what it is today, and the flags only cost a few logic gates. Things like every arithmetic instruction setting flags to detect zero, carry and overflow.
Are these particularly expensive to keep updated on a modern out of order implementation? Such that e.g. an ADD instruction updates the carry flag, and this must be tracked because although it will probably never be used, it is possible that some other instruction could use it N instructions later, with no fixed upper bound on N?
Are integer operations like addition and subtraction cheaper on instruction set architectures like MIPS that do not have these flags?
Various aspects of this are not very publicly known, so I will try to separate definitely known things from reasonable guesses and conjecture.
An approach has been to extend the (physical) integer registers (whether they take the form of a physical register file [eg P4 and SandyBridge+] or of results-in-ROB [eg P3]) with the flags that were produced by the operation that also produced the associated integer result. That's only about the arithmetic flags (sometimes AFLAGS, not to be confused with EFLAGS), but I don't think the "weird flags" are the focus of this question. Interestingly there is a patent[1] that hints at storing more than just the 6 AFLAGS themselves, putting some "combination flags" in there as well, but who know whether that was really done - most sources say the registers are extended by 6 bits, but AFAIK we (the public) don't really know. Lumping the integer result and associated flags together is described in for example this patent[2], which is primarily about preventing a certain situation where the flags might accidentally no longer be backed by any physical register. Aside from such quirks, during normal operation it has the nice effect of only needing to allocate 1 register for an arithmetic operation, rather than a separate main-result and flags-result, so renaming is normally not made much worse by the existence of the flags. Additionally, either the register alias table needs at least one more slot to keep track of which integer register contains the latest flags, or a separate flag-renaming-state buffer keeps track of the latest speculative flag state ([2] suggests Intel chose to separate them, which may simplify the main RAT but they don't go into such details). More slots may be used[3] to efficiently implement instructions which only update a subset of the flags (NetBurst™ famously lacked this, resulting in the now-stale advice to favour add over inc). Similarly, the non-speculative architectural state (whether it would be part of the retirement register file or be separate-but-similar again is not clear) needs at least one such slot.
A separate issue is computing the flags in the first place. [1] suggests separating flag generation from the main ALU simplifies the design. It's not clear to what degree they would be separated: the main ALU has to compute the Adjust and Sign flags anyway, and having an adder output a carry out the top is not much to ask (less than recomputing it from nothing). The overflow flag only takes an extra XOR gate to combine the carry into the top bit with the carry out of the top bit. The Zero flag and Parity flag are not for free though (and they depend on the result, not on the calculation of the result), if there is partial separation it would make sense that those would be computed separately. Perhaps it really is all separate. In NetBurst™, flag calculation took an extra half-cycle (the ALU was double-pumped and staggered)[4], but whether that means all flags are computed separately or a subset of them (or even a superset as [1] hinted) is not clear - the flags result is treated as monolithic so latency tests cannot distinguish whether a flag is computed in the third half-cycle by the flags unit or just handed to the flags unit by the ALU. In any case, typical ALU operations could be executed back-to-back, even if dependent (meaning that the high half of the first operation and the low half of the second operation ran in parallel), the delayed computation of the flags did not stand in the way of that. As you might expect though, ADC and SBB were not so efficient on NetBurst, but there may be other reasons for that too (for some reason a lot of µops are involved).
Overall I would conclude that the existence of arithmetic flags costs significant engineering resources to prevent them from having a significant performance impact, but that effort is also effective, so a significant impact is avoided.

Does the guarantee of non-divergence when dispatching single work item exist?

As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.
Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).

Are BOOST_LIKELY and __builtin_expect still relevant?

I understand what is explained here as well as these would include hints to CPU for static branch prediction.
I was wondering how relevant are these on Intel CPUs now that Intel CPUs have dropped support for static prediction hints as mentioned here. Also if I understand how it works now, the number of branch instructions in the path would be the only thing that the compiler can control and which branch path is predicted, fetched and decoded is decided at runtime.
Given this, are there any scenarios where branch hints in code are still useful for software targeting recent Intel processors, perhaps using conditional return or for avoiding the number of branch instructions in the critical path in case of nested if/else statements?
Also, if these are still relevant, any specifics on gcc and other popular compilers are appreciated.
P.S. I am not for premature optimization or for peppering the code with these macros, but I am interested in the topic as I am working with some time critical code and still like to reduce code clutter where possible.
Thanks
As in the comments section for your question you correctly figure out that:
There are no static branch prediction hints in opcode map anymore on Intel x86 CPUs;
Dynamic branch prediction for "cold" conditional jumps tend to predict the fallthrough path;
The compiler can use __builtin_expect to reorder what path of the if-then-else construct will be placed as a fallthrough case in generated assembly.
Now, consider a code base being compiled for multiple target architectures, not just Intel x86. A lot of them do have either static branch hints, dynamic branch predictors of different complexity, or both.
As an example, Intel Itanium architecture does offer an extensive system of prediction hints for all types of instructions: control flow, load/store etc. And Itanium was designed to have code being extensively optimized by a compiler with all these statically assigned instructions slots in a bundle and hints.
Therefore, __builtin_expect is still relevant for (rare) cases when 1) correct branch prediction information was too hard to deduce automatically by a compiler, and 2) the underlying hardware on at least one of target architectures was also known to be unable to reliably predict them dynamically. Given that certain low-power processors include primitive branch predictors that do not track branch history but always choose the fallthrough path, it starts to look beneficial. For modern Intel x86 hardware, not so much.

gcc likely unlikely macro usage

I am writing a critical piece of code with roughly the following logic
if(expression is true){
//do something with extremely low latency before the nuke blows up. This branch is entered rarely, but it is the most important case
}else{
//do unimportant thing that doesnt really matter
}
I am thinking to use likely() macro around the expression, so when it hits the important branch, I get minimum latency.
My question is that the usage is really opposite of the macro name suggest because I am picking the unlikely branch to be pre-fetch, i.e., the important branch is unlikely to happen but it is the most critical thing when it happens.
Is there a clear downside of doing this in terms of performance?
Yes. You are tricking the compiler by tagging the unlikely-but-must-be-fast branch as if it were the likely branch, in hopes that the compiler will make it faster.
There is a clear downside in doing that—if you don't write a good comment that explains what you're doing and why, some maintainer (possibly you yourself) in six months is almost guaranteed to say, "Hey, looks like he put the likely on the wrong branch" and "fix" it.
There is also a much less likely but still possible downside, that some version of some compiler that you use now or in the future will do different things than you're expecting with the likely macro, and those different things will not be what you wanted to trick the compiler into doing, and you'll end up with code that, every time through the loop, spends $100K speculatively getting 90% of the way through reactor shutdown before undoing it.
It's absolutely opposite of the traditional use of __builtin_expect(x, 1), which is used in the sense of the macro:
#define likely(x) __builtin_expect(x, 1)
which I would personally consider to be bad form (since you're cryptically marking the unlikely path as likely for a performance gain). However, you still could mark this optimization, as __builtin_expect(x) makes no assumptions about your needs by claiming a path "likey" - that's just the standard use.To do what you want, I'd suggest:
#define optimize_path(x) __builtin_expect(x, 1)
which will do the same thing, but rather than making the code accuse the unlikely path of being likely, you're now making the code describe what you're really attempting -- to optimize the critical path.
However, I should say that if you're planning on timing a nuke - you should not only be hand checking (and timing) the compiled assembly so that the timing is correct, but you should also be using a RTOS. A branch misprediction will have an extraordinarily insignificant effect, to the point that it's almost unnecessary here, since you can compensate for the "1 in a million" event by simply having a faster processor or correctly timing the delay for a mispredict. What does affect modern computer timings is OS preemption and scheduling. If you need something to happen on a very discrete timescale, you should be scheduling them for real-time, not psuedo-real time that most general purpose operating systems have. Branch misprediction is generally hundreds of times smaller than the delay that can occur from not using RTOS in an RT situation. Typically if you believe branch misprediction might be a problem, you remove the branch from time-sensitive issue, as the branch predictor typically has a state that is complex and out of your control. Macro's like "likely" and "unlikely" are for blocks of code that can be hit from various areas, with various branch prediction states, and most importantly are used very frequently. The high frequency of hitting these branches leads to a tangible increase in performance for applications that use it (like the Linux Kernel). If you only hit the branch once, you might get a 1 nanosecond performance boost in some cases, but if an application is ever that time critical, there are other things you can do to help yourself to much larger increases in performance.

Effects of branch prediction on performance?

When I'm writing some tight loop that needs to work fast I am often bothered by thoughts about how the processor branch prediction is going to behave. For instance I try my best to avoid having an if statement in the most inner loop, especially one with a result which is not somewhat uniform (say evaluates to true or false randomly).
I tend to do that because of the somewhat common knowledge that the processor pre-fetches instructions and if it turned out that it mis-predicted a branch then the pre-fetch is useless.
My question is - Is this really an issue with modern processors? How good can branch prediction expected to be?
What coding patterns can be used to make it better?
(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)
Branch prediction is pretty darned good these days. But that doesn't mean the penalty of branches can be eliminated.
In typical code, you probably get well over 99% correct predictions, and yet the performance hit can still be significant. There are several factors at play in this.
One is the simple branch latency. On a common PC CPU, that might be in the order of 12 cycles for a mispredict, or 1 cycle for a correctly predicted branch. For the sake of argument, let's assume that all your branches are correctly predicted, then you're home free, right? Not quite.
The simple existence of a branch inhibits a lot of optimizations.
The compiler is unable to reorder code efficiently across branches. Within a basic block (that is, a block of code that is executed sequentially, with no branches, one entry point and one exit), it can reorder instructions as it likes, as long as the meaning of the code is preserved, because they'll all be executed sooner or later. Across branches, it gets trickier. We could move these instructions down to execute after this branch, but then how do we guarantee they get executed? Put them in both branches? That's extra code size, that's messy too, and it doesn't scale if we want to reorder across more than one branch.
Branches can still be expensive, even with the best branch prediction. Not just because of mispredicts, but because instruction scheduling becomes so much harder.
This also implies that rather than the number of branches, the important factor is how much code goes in the block between them. A branch on every other line is bad, but if you can get a dozen lines into a block between branches, it's probably possible to get those instructions scheduled reasonably well, so the branch won't restrict the CPU or compiler too much.
But in typical code, branches are essentially free. In typical code, there aren't that many branches clustered closely together in performance-critical code.
"(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)"
Excellent. Then you can profile your application's performance, use gcc's tags to make a prediction and profile again, use gcc's tags to make the opposite prediction and profile again.
Now imagine theoretically a CPU that prefetches both branch paths. And for subsequent if statements in both paths, it will prefetch four paths, etc. The CPU doesn't magically grow four times the cache space, so it's going to prefetch a shorter portion of each path than it would do for a single path.
If you find half of your prefetches being wasted, losing say 5% of your CPU time, then you do want to look for a solution that doesn't branch.
If we're beyond the "early optimization" phase, then surely we're beyond the "I can measure that" phase as well? With the crazy complexities of modern CPU architecture, the only way to know for sure is to try it and measure. Surely there can't be that many circumstances where you will have a choice of two ways to implement something, one of which requires a branch and one which doesn't.
Not exactly an answer, but you can find here an applet demonstrates the finite state machine often used for table-based branch-prediction in modern microprocessors.
It illustrates the use extra logic to generate a fast (but possibly wrong) estimate for the branch condition and target address.
The processor fetches and executes the predicted instructions at full speed, but needs to revert all intermediate results when the prediction turns out to having been wrong.
Yes, branch prediction really can be a performance issue.
This question (currently the highest-voted question on StackOverflow) gives an example.
My answer is:
The reason AMD has been as fast or better than Intel at some points is the past is simply that they had better branch prediction.
If your code has no branch prediction, (Meaning it has no branches), then it can be expected to run faster.
So, conclusion: avoid branches if they're not necessary. If they are, try to make it so that one branch is evaluated 95% of the time.
One thing I recently found (on a TI DSP) is that trying to avoid branches can sometimes generate more code than the branch prediction cost.
I had something like the following in a tight loop:
if (var >= limit) { otherVar = 0;}
I wanted to get rid of the potential branch, and tried changing it to:
otherVar *= (var<limit)&1;
But the 'optimization' generated twice as much assembly and was actually slower.

Resources