Vectorization flags with Eigen and IPOPT - eigen

I have some C++ function that I am optimizing with IPOPT. Although the cost function, constraint functions, etc. are written in C++, the code was originally written to use the C-interface. I haven't bothered to change that yet unless it turns out to be the issue.
Anyway... We are observing some unexpected behavior where the optimizer converges differently when we compile the program with/without vectorization flags. Specifically, in the CMakeLists file, we have
set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")
When we run the optimizer with these settings, then the optimizer converges in approximately 100 iterations. So far, so good.
However, we have reason to believe that when compiled for ARM (Android specifically), there is no vectorization occurring because the performance is drastically different than on an Intel processor. The Eigen documentation says that NEON instructions should always be enabled for 64-bit ARM, but we have reason to suspect that that is not occurring. Anyway, that is not the question here.
Due to this suspicion, we wanted to see how bad the performance would be on our Intel processor if we disabled vectorization. This should give us some indication of how much vectorization is occurring, and how much improvement we might expect to see in ARM. However, when we change the compiler flags to
set(CMAKE_CXX_FLAGS "-Wall")
(or to just to the case where we just use AVX (without fma)), then we get the same general solution from the optimizer, but with very different converge performance. Specifically, without vectorization, the optimizer takes about 500 iterations to converge to the solution.
So in summary:
With AVX and FMA : 100 iterations to converge
With AVX : 200 iterations to converge
Without AVX and FMA : 500 iterations to converge
We are literally only changing that one line in the cmake file, not the source code.
I would like some suggestions for why this may be occurring.
My thoughts and more background info:
It seems to me that either the version with or without vectorization must be doing some rounding, and that is making IPOPT converge differently. I was under the impression that adding AVX and FMA flags would not change the output of the functions, but rather only the time it takes to compute them. I appear to be wrong.
The phenomenon we are observing appears particularly strange to me because on one hand we are observing that the optimizer always converges to the same solution. This somehow suggests that the problem can't be too ill-conditioned. However, on the other hand, the fact that the optimizer is behaving differently with/without vectorization flags suggests that the problem IS indeed sensitive to whatever small residuals are generated by vectorized instructions.
One other thing to keep in mind is that we precompiled IPOPT into a library, and are simply linking our code against that precompiled library. So I don't think that the AVX and FMA flags can be affecting the optimizer itself. That seems to mean that our functions must be outputting values with tangibly different values depending on whether vectorization is enabled.
For those interested, here is the full cmake file
cmake_minimum_required(VERSION 3.5)
# If a build type is not passed to cmake, then use this...
if(NOT CMAKE_BUILD_TYPE)
# set(CMAKE_BUILD_TYPE Release)
set(CMAKE_BUILD_TYPE Debug)
endif()
# If you are debugging, generate symbols.
set(CMAKE_CXX_FLAGS_DEBUG "-g")
# If in release mode, use all possible optimizations
set(CMAKE_CXX_FLAGS_RELEASE "-O3")
# We need c++11
set(CMAKE_CXX_STANDARD 11)
# Show us all of the warnings and enable all vectorization options!!!
# I must be crazy because these vectorization flags seem to have no effect.
set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")
if (CMAKE_SYSTEM_NAME MATCHES "CYGWIN")
include_directories(../../Eigen/
/cygdrive/c/coin/windows/ipopt/include/coin/
/cygdrive/c/coin/windows/ipopt/include/coin/ThirdParty/)
find_library(IPOPT_LIBRARY ipopt HINTS /cygdrive/c/coin/windows/ipopt/lib/)
else ()
include_directories(../../Eigen/
../../coin/CoinIpopt/build/include/coin/
../../coin/CoinIpopt/build/include/coin/ThirdParty/)
find_library(IPOPT_LIBRARY ipopt HINTS ../../coin/CoinIpopt/build/lib/)
endif ()
# Build the c++ functions into an executable
add_executable(trajectory_optimization main.cpp)
# Link all of the libraries together so that the C++-executable can call IPOPT
target_link_libraries(trajectory_optimization ${IPOPT_LIBRARY})

Enabling FMA will result in different rounding behavior, which can lead to very different results, if your algorithm is not numerically stable. Also, enabling AVX in Eigen will result in different order of additions and since floating point math is non-associative, this can also lead to slightly different behavior.
To illustrate why non-associativity can make a difference, when adding 8 consecutive doubles a[8] with SSE3 or with AXV, Eigen will typically produce code equivalent to the following:
// SSE:
double t[2]={a[0], a[1]};
for(i=2; i<8; i+=2)
t[0]+=a[i], t[1]+=a[i+1]; // addpd
t[0]+=t[1]; // haddpd
// AVX:
double t[4]={a[0],a[1],a[2],a[3]};
for(j=0; j<4; ++j) t[j]+=a[4+j]; // vaddpd
t[0]+=t[2]; t[1]+=t[3]; // vhaddpd
t[0]+=t[1]; // vhaddpd
Without more details it is hard to tell what exactly happens in your case.

Related

Does GCC's ffast-math have consistency guarantees across platforms or compiler versions?

I want to write cross-platform C/C++ which has reproducible behaviour across different environments.
I understand that gcc's ffast-math enables various floating-point approximations. This is fine, but I need two separately-compiled binaries to produce the same results.
Say I use gcc always, but variously for Windows, Linux, or whatever, and different compiler versions.
Is there any guarantee that these compilations will yield the same set of floating-point approximations for the same source code?
No, it's not that they allow specific approximations, it's that -ffast-math allows compilers to assume that FP math is associative when it's not. i.e. ignore rounding error when transforming code to allow more efficient asm.
Any minor differences in choice of order of operations can affect the result by introducing different rounding.
Older compiler versions might choose to implement sqrt(x) as x * approx_rsqrt(x) with a Newton-Raphson iteration for -ffast-math, because older CPUs had a slower sqrtps instruction so it was more often worth it to replace it with an approximation of the reciprocal-sqrt + 3 or 4 more multiply and add instructions. This is generally not the case in most code for recent CPUs, so even if you use the same tuning options (especially the default -mtune=generic instead of -mtune=haswell for example), the choices that option makes can change between GCC versions.
It's hard enough to get deterministic FP without -ffast-math; different libraries on different OSes have different implementations of functions like sin and log (which unlike the basic ops + - * / sqrt are not required to return a "correctly rounded" result, i.e. max error 0.5ulp).
And extra precision for temporaries (FLT_EVAL_METHOD) can change the results if you compile for 32-bit x86 with x87 FP math. (-mfpmath=387 is the default for -m32). If you want to have any hope here, you'll want to avoid 32-bit x86. Or if you're stuck with it, maybe you can get away with -msse2 -mfpmath=sse...
You mentioned Windows, so I'm assuming you're only talking about x86 GNU/Linux, even though Linux runs on many other ISAs.
But even just within x86, compiling with -march=haswell enables use of FMA instructions, and GCC defaults to #pragma STDC FP_CONTRACT ON (even across C statements, beyond what the usual ISO C rules allow.) So actually even without -ffast-math, FMA availability can remove rounding for the x*y temporary in x*y + z.
With -ffast-math:
One version of gcc might decide to unroll a loop by 2 (and use 2 separate accumulators), when summing sum an array, while an older version of gcc with the same options might still sum in order.
(Actually current gcc is terrible at this, when it does unroll (not by default) it often still uses the same (vector) accumulator so it doesn't hide FP latency the way clang does. e.g. https://godbolt.org/z/X6DTxK uses different registers for the same variable, but it's still just one accumulator, no vertical addition after the sum loop. But hopefully future gcc versions will be better. And differences between gcc versions in how they do a horizontal sum of a YMM or XMM register could introduce differences there when auto-vectorizing)

GCC Auto Vectorization

In gcc compiler is there a way to enable auto vectorization only? I do know that -ftree-vectorize flag enables auto vectorization. But it requires at least -O2 optimization level. Is there a way to enable auto vectorization without using the -O2 optimization flag?
Thanks in advance.
You could actually get decent auto vectorization with -ftree-vectorize combined with -O1, for example: Godbolt.
With -O0, however, vectorized code won't be generated, even for very simple examples.
I suspect that gcc's tree vectorizer isn't even called with -O0, or called and bails out, but that has to be verified in the gcc source code.
Generally, -O0 and auto vectorization don't mix very well. In compilers, optimizations happen in phases, where each optimization phase prepares the ground for the next one.
For auto vectorization to occur, at least on non trivial examples, the compiler has to perform some optimizations beforehand. For example, loops that contain jumps usually cannot be vectorized, unless branches are eliminated and replaced with predicated instructions by an optimization called if-conversion - resulting in a flat block of code, which could be vectorized more conviniently.
Footnote - I came across this nice presentation about GCC auto vectorization, which you may find interesting - it gives a good introduction to auto vectorization with gcc, compiler flags and basic concepts.

GCC optimization levels. Which is better?

I am focusing on the CPU/memory consumption of compiled programs by GCC.
Executing code compiled with O3 is it always so greedy in term of resources ?
Is there any scientific reference or specification that shows the difference of Mem/cpu consumption of different levels?
People working on this problem often focus on the impact of these optimizations on the execution time, compiled code size, energy. However, I can't find too much work talking about resource consumption (by enabling optimizations).
Thanks in advance.
No, there is no absolute way, because optimization in compilers is an art (and is even not well defined, and might be undecidable or intractable).
But some guidelines first:
be sure that your program is correct and has no bugs before optimizing anything, so do debug and test your program
have well designed test cases and representative benchmarks (see this).
be sure that your program has no undefined behavior (and this is tricky, see this), since GCC will optimize strangely (but very often correctly, according to C99 or C11 standards) if you have UB in your code; use the -fsanitize=style options (and gdb and valgrind ....) during debugging phase.
profile your code (on various benchmarks), in particular to find out what parts are worth optimization efforts; often (but not always) most of the CPU time happens in a small fraction of the code (rule of thumb: 80% of time spent in 20% of code; on some applications like the gcc compiler this is not true, check with gcc -ftime-report to ask gcc to show time spent in various compiler modules).... Most of the time "premature optimization is the root of all evil" (but there are exceptions to this aphorism).
improve your source code (e.g. use carefully and correctly restrict and const, add some pragmas or function or variable attributes, perhaps use wisely some GCC builtins __builtin_expect, __builtin_prefetch -see this-, __builtin_unreachable...)
use a recent compiler. Current version (october 2015) of GCC is 5.2 (and GCC 8 in june 2018) and continuous progress on optimization is made ; you might consider compiling GCC from its source code to have a recent version.
enable all warnings (gcc -Wall -Wextra) in the compiler, and try hard to avoid all of them; some warnings may appear only when you ask for optimization (e.g. with -O2)
Usually, compile with -O2 -march=native (or perhaps -mtune=native, I assume that you are not cross-compiling, if you do add the good -march option ...) and benchmark your program with that
Consider link-time optimization by compiling and linking with -flto and the same optimization flags. E.g., put CC= gcc -flto -O2 -march=native in your Makefile (then remove -O2 -mtune=native from your CFLAGS there)...
Try also -O3 -march=native, usually (but not always, you might sometimes has slightly faster code with -O2 than with -O3 but this is uncommon) you might get a tiny improvement over -O2
If you want to optimize the generated program size, use -Os instead of -O2 or -O3; more generally, don't forget to read the section Options That Control Optimization of the documentation. I guess that both -O2 and -Os would optimize the stack usage (which is very related to memory consumption). And some GCC optimizations are able to avoid malloc (which is related to heap memory consumption).
you might consider profile-guided optimizations, -fprofile-generate, -fprofile-use, -fauto-profile options
dive into the documentation of GCC, it has numerous optimization & code generation arguments (e.g. -ffast-math, -Ofast ...) and parameters and you could spend months trying some more of them; beware that some of them are not strictly C standard conforming!
recent GCC and Clang can emit DWARF debug information (somehow "approximate" if strong optimizations have been applied) even when optimizing, so passing both -O2 and -g could be worthwhile (you still would be able, with some pain, to use the gdb debugger on optimized executable)
if you have a lot of time to spend (weeks or months), you might customize GCC using MELT (or some other plugin) to add your own new (application-specific) optimization passes; but this is difficult (you'll need to understand GCC internal representations and organization) and probably rarely worthwhile, except in very specific cases (those when you can justify spending months of your time for improving optimization)
you might want to understand the stack usage of your program, so use -fstack-usage
you might want to understand the emitted assembler code, use -S -fverbose-asm in addition of optimization flags (and look into the produced .s assembler file)
you might want to understand the internal working of GCC, use various -fdump-* flags (you'll get hundred of dump files!).
Of course the above todo list should be used in an iterative and agile fashion.
For memory leaks bugs, consider valgrind and several -fsanitize= debugging options. Read also about garbage collection (and the GC handbook), notably Boehm's conservative garbage collector, and about compile-time garbage collection techniques.
Read about the MILEPOST project in GCC.
Consider also OpenMP, OpenCL, MPI, multi-threading, etc... Notice that parallelization is a difficult art.
Notice that even GCC developers are often unable to predict the effect (on CPU time of the produced binary) of such and such optimization. Somehow optimization is a black art.
Perhaps gcc-help#gcc.gnu.org might be a good place to ask more specific & precise and focused questions about optimizations in GCC
You could also contact me on basileatstarynkevitchdotnet with a more focused question... (and mention the URL of your original question)
For scientific papers on optimizations, you'll find lots of them. Start with ACM TOPLAS, ACM TACO etc... Search for iterative compiler optimization etc.... And define better what resources you want to optimize for (memory consumption means next to nothing....).

Testing FPU on arm processor

I am using a Wandboard-Quad that contains an i.MX6 ARM processor. This processor has an FPU that I would like to utilize. Before I do, I want to test how much improvement I will get. I have a benchmark algorithm and have tried it with no optimization, and with -mfpu=vfp and there appears to be no improvement -- I do get improvement with optimization = 3.
I am using arm-linux-gnueabi libraries -- Any thoughts on what is incorrect and how I can tell if I am using the FPU?
Thanks,
Adam
Look at the assembler output with a -S flag and see if there are any fpu instructions being generated. That's probably the easiest thing.
Beyond that, there is a chance that your algorithm was using floating point so rarely that any use would be masked by loading and unloading the FPU registers. In that case, O3 optimizations in your other parts of the code would show you gains separate of the FPU usage.
-mfpu option works only when GCC is performing vectorization. Vectorization itself requires reasonable optimization level (minimum is -O2 with -ftree-vectorize option on). So try -O3 -ftree-vectorize -mfpu=vfp to utilize FPU and measure difference against simple -O3 level.
Also see ARM GCC docs for cases where -funsafe-math-optimizations may be required.
Without any optimisation the output from GCC is so inefficient that you might actually not be able to measure the difference between software and hardware floating point.
To see the benefits that the FPU adds, you need to test with a consistent optimisation level, and then use either -msoft-float or -mhard-float.
This will force the compiler to link against different libraries and make function calls for the floating-point operations rather than using native instructions. It is still possible that the underlying library uses hardware floating point, but I wouldn't worry about that too much.
You can select different sets of FP instructions using -mfpu=. For i.MX6 I think you want -mfpu=neon, as that should enable all applicable floating-point instructions (not just the NEON ones).

Compile time comparison between Windows GCC and MSVC compiler

We are working on reducing compile times on Windows and are therefore considering all options. I've tried to look on Google for a comparison between compile time using GCC (MinGW or Cygwin) and MSVC compiler (CL) without any luck. Of course, making a comparison would not be to hard, but I'd rather avoid reinventing the wheel if I can.
Does anyone know of such an comparison out there? Or maybe anyone has some hands-on-experience?
Input much appreciated :)
Comparing compiler is not trivial:
It may vary from processor to processor. GCC may better optimize for i7 and MSVC for Core 2 Duo or vice versa. Performance may be affected by cache etc. (Unroll loops or don't unroll loops, that is the question ;) ).
It depends very largely on how code is written. Certain idioms (equivalent to each other) may be preferred by one compiler.
It depends on how the code is used.
It depends on flags. For example gcc -O3 is known to often produce slower code then -O2 or -Os.
It depends on what assumption can be made about code. Can you allow strict aliasing or no (-fno-strict-aliasing/-fstrict-aliasing in gcc). Do you need full IEEE 754 or can you bent floating pointer calculation rules (-ffast-math).
It also depends on particular processor extensions. Do you enable MMX/SSE or not. Do you use intrinsics or no. Do you depend that code is i386 compatible or not.
Which version of gcc? Which version of msvc?
Do you use any of the gcc/msvc extensions?
Do you use microbenchmarking or macrobenchmarking?
And at the end you find out that the result was less then statistical error ;)
Even if the single application is used the result may be inconclusive (function A perform better in gcc but B in msvc).
PS. I would say cygwin will be slowest as it has additional level of indirection between POSIX and WinAPI.

Resources