I understand what is explained here as well as these would include hints to CPU for static branch prediction.
I was wondering how relevant are these on Intel CPUs now that Intel CPUs have dropped support for static prediction hints as mentioned here. Also if I understand how it works now, the number of branch instructions in the path would be the only thing that the compiler can control and which branch path is predicted, fetched and decoded is decided at runtime.
Given this, are there any scenarios where branch hints in code are still useful for software targeting recent Intel processors, perhaps using conditional return or for avoiding the number of branch instructions in the critical path in case of nested if/else statements?
Also, if these are still relevant, any specifics on gcc and other popular compilers are appreciated.
P.S. I am not for premature optimization or for peppering the code with these macros, but I am interested in the topic as I am working with some time critical code and still like to reduce code clutter where possible.
Thanks
As in the comments section for your question you correctly figure out that:
There are no static branch prediction hints in opcode map anymore on Intel x86 CPUs;
Dynamic branch prediction for "cold" conditional jumps tend to predict the fallthrough path;
The compiler can use __builtin_expect to reorder what path of the if-then-else construct will be placed as a fallthrough case in generated assembly.
Now, consider a code base being compiled for multiple target architectures, not just Intel x86. A lot of them do have either static branch hints, dynamic branch predictors of different complexity, or both.
As an example, Intel Itanium architecture does offer an extensive system of prediction hints for all types of instructions: control flow, load/store etc. And Itanium was designed to have code being extensively optimized by a compiler with all these statically assigned instructions slots in a bundle and hints.
Therefore, __builtin_expect is still relevant for (rare) cases when 1) correct branch prediction information was too hard to deduce automatically by a compiler, and 2) the underlying hardware on at least one of target architectures was also known to be unable to reliably predict them dynamically. Given that certain low-power processors include primitive branch predictors that do not track branch history but always choose the fallthrough path, it starts to look beneficial. For modern Intel x86 hardware, not so much.
Related
How can I get the SICStus Prolog JIT to use any of the following ISA?
Intel BMI:
POPCNT, LZCNT, TZCNT, PDEP, PEXT
Intel CLMUL: PCLMULQDQ
ARM AArch64: RBIT
I need them for supercharging clpz. Right now, I got:
http://www.hackersdelight.org/ and
the non-ISO arithmetic function msb/1.
For a start that's good, but I want more. Please help!
Unfortunately, there is no way for users to extend the JIT for cases like this.
I have been thinking about accessing the population count instructions (for some unrelated uses) from Prolog. The way to add this and other similar instructions would be:
Add a new arithmetic instruction to is/2. This needs to be supported by all our code, not just JIT-compiled code, so interpreter, WAM-emulator, various internal byte-code-processors, all the static analyzers in our IDE, etc. etc.
Add JIT-compilation that just calls back into the corresponding C routine in the runtime system.
If it can be demonstrated to benefit performance sufficiently, make the JIT compiler emit the special purpose CPU instructions for targets that have them.
(1) requires sufficient user demand (or explicit financing, of course). (3) requires convincing benchmarks. Currently neither of these are available, but that could change, of course.
One of the most famous stackoverflow questions is why sorting a sorted array is so fast; and the answer is because of branch prediction.
Will the application of Intel's and Microsoft spectre fixes effectively nullify the answer given in this question on the affected processors (older generation Intel processors, AMD Ryzen, and ARM)?
No, the key to Spectre is forcing mis-prediction of indirect branches, because they can jump to any address. It's non-trivial to find a sequence of instructions that loads secret data you want, and then makes another data-dependent load with the secret as an array index.
To attack a regular taken / not-taken conditional branch (like you'd find in a sort function, or that conditional in the loop over a sorted or not-sorted array), you'd need to find a case where executing the "wrong" side of a branch (maybe the wrong side of an if/else in the source) would do something useful when it runs with the "wrong" values in registers. It's plausible1, but unlikely, so most defenses against Spectre will only worry about indirect branches.
Hardware fixes for Spectre will have to be more subtle than "turn off branch prediction" (i.e. stall the pipeline at every conditional branch). That would probably reduce performance by an order of magnitude in a lot of code, and is far too high to be an acceptable defense against a local information leak (which can lead to privilege escalation).
Even turning off prediction for only indirect branches (but not regular conditional branches) may be too expensive for most user-space code, because every shared library / DLL function call goes through an indirect branch in the normal software ecosystem on mainstream OSes (Linux, OS X, Windows).
The Linux kernel is experimenting with a retpoline to defeat indirect-branch prediction for indirect branches inside the kernel. I'm not sure it's enabled by default, though, even in kernels that enable the Meltdown workaround (KPTI).
Footnotes:
Sometimes the wrong case of a switch could do something totally inappropriate (e.g. in an interpreter), and if the switch was compiled with nested branches rather than a single indirect branch then you might be able to attack it. (Compilers often use a table of branch targets for switch, but when the cases are sparse it's not always possible. e.g. case 10 / case 100 / case 1000 / default would need a 990-entry array with only 3 used values.)
C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?
The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...
our compilers course features exercises asking us to compare code built with the -O and -O3 gcc options. The code generated by my machine isn't the same as the code in the course. Is there a way to figure the optimization options used in the course, in order to obtain the same code on my machine, and make more meaningful observations?
I found how to get the optimization options on my machine :
$ gcc -O3 -Q --help=optimizer
But is there a way to deduce those on the machine of the professor except by trying them all and modifying them one by one (.ident "GCC: (Debian 4.3.2-1.1) 4.3.2")?
Thanks for your attention.
Edit:
I noticed that the code generated on my machine lacks the prologue and epilogue generated on my professor's. Is there an option to force prologue generation (google doesn't seem to bring much)?
Here's what you need to know about compiler optimizations : they are architecture dependent. Also, they're mainly different from one version of the compiler to another (gcc-4.9 does more stuff by default than gcc-4.4).
By architecture, I mean CPU micro architecture (Intel : Nehalem, Sandy bridge, Ivy Bridge, Haswell, KNC ... AMD : Bobcat, Bulldozzer, Jaguar, ...). Compilers usually convert input code (C, C++, ADA, ...) into a CPU-agnostic intermediary representation (GIMPLE for GCC) on which a large number of optimizations will be performed. After that, the compiler will generate a lower level representation closer to assembly. On the latter, architecture specific optimizations will be unrolled. Such optimizations include the choice of instructions with the lowest latencies, determining loop unroll factors depending on the loop size, the instruction cache size, and so on.
Since your generated code is different from the one you got in class, I suppose the underlying architectures must be different. In this case, even with the same compiler flags you won't be able to get the same assembly code (even with no optimizations you'll get different assembly codes).
For that, you should concentrate on comparing the optimized and non-optimized codes rather than trying to stick to what you were given in class. I even think that it's a great reverse engineering exercise to compare your optimized code to the one you were given.
You can find one of my earlier posts about compiler optimizations in here.
Two great books on the subject are The Dragon Book (Compilers: Principles, Techniques, and Tools) by Aho, Seti, and Ulman, and also Engineering a Compiler by Keith Cooper, and Linda Torczon.
Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.