I am porting the code from INTEL architecture to ARM architecture. The same code I am trying to build with arm cross compilers on centos. Are there any compiler options that increase the performance of executable image because the image on INTEL might not give similar performance on ARM. Is there a way to achieve this?
A lot of optimization options exist in GCC. By default the compiler tries to make the compilation process as short as possible and produce an object code that makes debugging easy. However, gcc also provides several options for optimization.
Four general levels of incremental optimizations for performance are available in GCC and can be activated by passing one of the options -O0, -O1, -O2 or -O3. Each of these levels activates a set of optimizations that can be also activated manually by specifying the corresponding command line option. For instance with -O1 the compiler performs branches using the decrement and branch instruction (if available and applicable) rather then decrementing the register, comparing it to zero and branching in separate instructions. This decrement and branch can be also specified manually by passing the -fbranch-count-reg option. Consider that optimizations performed at each level also depend on the target architecture, you can get the list of available and enabled optimization by running GCC with the -Q --help=optimizers option.
Generally speaking the levels of optimization correspond to (notice that at each level also the optimization of the previous ones are applied):
-O0: the default level, the compiler tries to reduce compilation time and produce an object code that can be easily processed by a debugger
-O1: the compiler tries to reduce both code size and execution time. Only optimizations that do not take a lot of compile time are performed. Compilation may take considerably more memory.
-O2: the compiler applies almost all optimizations available that do not affect the code size. Compiling takes more but performance should improve.
-O3: the compiler applies also optimization that might increase the code size (for instance inlining functions)
For a detailed description of all optimization options you can have a look here.
As a general remark consider that compiler optimization are designed to work in the general case but their effectiveness will depend a lot on both you program and the architecture you are running it on.
Edit:
If you are interested in memory paging optimization there is the -freorder-blocks-and-partition option (activated also with -O2). This option reorders basic blocks inside each function to partition them in hot blocks (called frequently) and cold blocks (called rarely). Hot blocks are then placed in contiguous memory locations. This should increasing cache locality and paging performance.
Related
I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.
One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.
However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.
Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:
while (my_variable != what_i_want) {}
and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:
while (my_variable != what_i_want)
cpu_relax();
It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).
I have several gaps here:
1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?
2) Is the same true for other instructions such as smb_mb() and the likes?
3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?
4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?
Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.
There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.
The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.
The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.
Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).
If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.
I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....
The Eigen web site says:
Explicit vectorization is performed for SSE 2/3/4, AVX, FMA, AVX512, ARM NEON (32-bit and 64-bit), PowerPC AltiVec/VSX (32-bit and 64-bit) instruction sets, and now S390x SIMD (ZVector) with graceful fallback to non-vectorized code.
Does that mean if you compile, e.g. with FMA, and the CPU you are running on does not support it it will fall back to completely unvectorised code? Or will it fall back to the best vectorisation available?
If not, is there any way to have Eigen compile for all or several SIMD ISAs and automatically pick the best at runtime?
Edit: To be clear I'm talking about run-time fallbacks.
There is absolutely no run-time dispatching in Eigen. Everything that happens happens at compile-time. This is where you need to make all of the choices, either via preprocessor macros that control the library's behavior, or using optimization settings in your C++ compiler.
In order to implement run-time dispatching, you would either need to check and see what CPU features were supported on each and every call into the library and branch into the applicable implementation, or you would need to do this check once at launch and set up a table of function pointers (or some other method to facilitate dynamic dispatching). Eigen can't do the latter because it is a header-only library (just a collection of types and functions), with no "main" function that gets called upon initialization where all of this setup code could be localized. So the only option would be the former, which would result in a significant performance penalty. The whole point of this library is speed; introducing this type of performance penalty to each of the library's functions would be a disaster.
The documentation also contains a breakdown of how Eigen works, using a simple example. This page says:
The goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
which lends further credence to the claim that static, compile-time options determine how the library will work.
Whichever instruction set you choose to target, the generated code will have those instructions in it. If you try to execute that code on a processor that does not support these instructions (for example, you compile Eigen with AVX optimizations enabled, but you run it on an Intel Nehalem processor that doesn't support the AVX instruction set), then you will get an invalid instruction exception (presented to your program in whatever form the operating system passes CPU exceptions through). This will happen as soon as your CPU encounters an unrecognized/unsupported instruction, which will probably be deep in the bowels of one of the Eigen functions (i.e., not immediately upon startup).
However, as I said in the comments, there is a fallback mechanism of sorts, but it is a static one, all done at compile time. As the documentation indicates, Eigen supports multiple vector instruction sets. If you choose a lowest-common-denominator instruction set, like SSE2, you will still get some degree of vectorization. For example, although SSE3 may provide a specialized instruction for some particular task, if you can't target SSE3, all hope is not lost. There is, in many places, code that uses a series of SSE2 instructions to accomplish the same task. These are probably not as efficient as if SSE3 were available, but they'll still be faster than having no vector code whatsoever. You can see examples of this by digging into the source code, specifically the arch folder that contains the tidbits that have been specifically optimized for the different instruction sets (often by the use of intrinsics). In other words, you get the best code it can give you for your target architecture/instruction set.
I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.
This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.
Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches
I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.
What is best practice? Multiple binaries? Naming?
As long as you don't care about portability, yes.
Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.
If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).
AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.
If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.
When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.
See my longer explanation and references in this thread.
If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.
MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: #nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.
Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.
I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.
Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.