Vectorizing fortran array operation - performance

I am looking into optimizing some fortran code and one kind of optimization that came to mind is using vector instructions. Of course the compiler vectorizes a lot of operations when compiling with -O3 (I am using gfortran compiler). Naturally it misses some vectorization opportunities that seem possible for example this line of code:
p%glnrho(:,i)=p%rho1*p%grho(:,i)
Here p is declared type and the arrays are real arrays of dimension (nx,3),(nx),(nx,3) respectively.
With -fopt-info-vec-missed I get the following explanation:
missed:
not vectorized: complicated access pattern.
There are multiple of these kinds of array operations so I am thinking that does there exist a simple hint that I can give the compiler that would work in this case that I could use in multiple places. Or possibly some other approach that wouldn't require too much code modification.
Apologies if the question isn't explained well enough or too specific to my situation. I am not the most familiar how the compiler translates fortran array operations to vector instructions so any answers would be appreciated :)
What have I tried:
Tried to compile with optimizations turned on i.e. -O3 and see how well the compiler translates array operations to vector instructions. The compiler seems to miss some opportunities that don't seem that complex

I would try first an associate construct to hopefully simplify what the compiler sees:
ASSOCIATE( a => p%glnrho(:,i), b => p%rho1*p%grho(:,i) )
a(:) = b(:)
END ASSOCIATE
To answer a possible objection, associate names are not pointers, despite the similar notation.
EDIT: if it's still not optimizing, just try a = b as mentioned by #VladimirFГероямслава in the comments.

Related

efficiency loss due to use of function pointer in place of if-block

Suppose we have a Fortran function (for example a mathematical optimization algorithm) that takes as input, another Fortran function:
myOptimizer(func)
Now depending on the user's choice, the input function could be from a list of several different functions. This list of choices can be implemented via an if-block:
if (userChoice=='func1') then
myOptimizer(func1)
elseif (userChoice=='func2') then
myOptimizer(func2)
elseif (userChoice=='func3') then
myOptimizer(func3)
end if
Alternatively, I could also define function pointers, and write this as,
if (userChoice=='func1') then
func => func1
elseif (userChoice=='func2') then
func => func2
elseif (userChoice=='func3') then
func => func3
end if
myOptimizer(func)
Based on my tests with Intel Fortran Compiler 2017 with O2 flag, the second implementation happens to be slower by several factors (4-5 times slower than the if-block implementation). From the software development perspective, I would strongly prefer the second approach since it results in much more concise and cleaner code, at least in my problem where there is a fixed workflow, with different possible input functions to the workflow. However, performance also equally matters in the problem.
Is this loss of performance by indirect function calls, expected in all Fortran codes? or is it a compiler-dependent issue? Is there a solution to using indirect function calls without performance loss? How about other languages such as C/C++?
This is a pure guess based on how compilers generally work and what might explain the 4-5x perf difference.
In the first version, maybe the compiler is inlining myOptimizer() into each call site with func1, func2, and func3 inlined into the optimizer, so when it runs there's no actual function pointer or function call happening.
An indirect function-call isn't much more expensive than a regular function call on modern x86 hardware. It's the lack of inlining that really hurts, especially for FP code. Spilling / reloading all the floating-point registers around a function call is expensive, especially if the function is fairly small.
i.e. what's probably hurting you is that your 2nd version convinces the compiler not to undo the indirection. This would be true in C / C++ as well.
Hand-holding your compiler into making fast asm probably means you have to write it the first way, unless there's a profile-guided optimization option you can use that might make the compiler realize this is a hot spot and it's worth trying harder with the source written the 2nd way. (Sorry I don't use Fortran, and I only know a few of the options for Intel's C/C++ compiler from looking at its asm output vs. gcc and clang on http://gcc.godbolt.org/)
To see if my hypothesis is right, check the compiler-generated asm. If the first version doesn't actually pass a function pointer to a stand-alone definition of myOptimizer, but the 2nd one does, that's probably all there is to it.
See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output. Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a good intro to reading compiler output and why you might want to.

Does an ODE written using GNU gsl outperform Mathematica's NDSolve?

Would an ODE solver written in C perhaps using the GSL library have significant speed advantages compared with Mathematica 8.0 NDSolve? How would it fair in terms of accuracy?
My understanding is that compiled code could in principle be faster, but that these days NDSolve uses a lot of compiled code itself already somehow?
Also are there any options for using things like MathLink or Mathematica's compile function to speed solving an ODE up?
NDSolve and other numerical functions in Mathematica automatically compile your operand (e.g. the RHS of an ODE) to an intermediate "bytecode" language (the same one used by the Compile function). If you like you can specify CompilationTarget -> "C" and the function will be compiled all the way to C code and linked back in to Mathematica... You can see the generated C code yourself in this previous question on the Mathematica Stack Exchange:
https://mathematica.stackexchange.com/questions/821/how-well-does-mathematica-code-exported-to-c-compare-to-code-directly-written-fo/830#830
Of course, it's always possible in principle to hand-write a faster algorithm... But there are a lot of things to optimize that Mathematica will do automatically. You probably don't want to be responsible for manually optimizing the computation of a sparse matrix of partial derivatives in an optimization problem for example.
Mathematica's focus is on usability. They do use numerical libraries. So the speed would be the same as the best available library or worse (in almost all cases). for example, i heard they use eigen for matrix stuff.
the other thing that you should consider is that although they optimize functions that they provide, your own functions are not optimized. so the derivative that you calculate at each step would be faster in c.
to my friends that decide between mathematica and c++, i tell to go with mathematica since they should focus on getting results fast rather than building the fastest code.

Performance Implications of Point-Free style

I’m taking my first baby-steps in learning functional programing using F# and I’ve just come across the Forward Pipe (|>) and Forward Composition (>>) operators. At first I thought they were just sugar rather than having an effect on the final running code (though I know piping helps with type inference).
However I came across this SO article:
What are advantages and disadvantages of “point free” style in functional programming?
Which has two interesting and informative answers (that instead of simplifying things for me opened a whole can of worms surrounding “point-free” or “pointless” style) My take-home from these (and other reading around) is that point-free is a debated area. Like lambas, point-free style can make code easier to understand, or much harder, depending on use. It can help in naming things meaningfully.
But my question concerns a comment on the first answer:
AshleyF muses in the answer:
“It seems to me that composition may reduce GC pressure by making it more obvious to the compiler that there is no need to produce intermediate values as in pipelining; helping make the so-called "deforestation" problem more tractable.”
Gasche replies:
“The part about improved compilation is not true at all. In most languages, point-free style will actually decrease performances. Haskell relies heavily on optimizations precisely because it's the only way to make the cost of these things bearable. At best, those combinators are inlined away and you get an equivalent pointful version”
Can anyone expand on the performance implications? (In general and specifically for F#) I had just assumed it was a writing-style thing and the compiler would unstrangle both idioms into equivalent code.
This answer is going to be F#-specific. I don't know how the internals of other functional languages work, and the fact that they don't compile to CIL could make a big difference.
I can see three questions here:
What are the performance implications of using |>?
What are the performance implications of using >>?
What is the performance difference between declaring a function with its arguments and without them?
The answers (using examples from the question you linked to):
Is there any difference between x |> sqr |> sum and sum (sqr x)?
No, there isn't. The compiled CIL is exactly the same (here represented in C#):
sum.Invoke(sqr.Invoke(x))
(Invoke() is used, because sqr and sum are not CIL methods, they are FSharpFunc, but that's not relevant here.)
Is there any difference between (sqr >> sum) x and sum (sqr x)?
No, both samples compile to the same CIL as above.
Is there any difference between let sumsqr = sqr >> sum and let sumsqr x = (sqr >> sum) x?
Yes, the compiled code is different. If you specify the argument, sumsqr is compiled into a normal CLI method. But if you don't specify it, it's compiled as a property of type FSharpFunc with a backing field, whose Invoke() method contains the code.
But the effect of all is that invoking the point-free version means loading one field (the FSharpFunc), which is not done if you specify the argument. But I think that shouldn't measurably affect performance, except in the most extreme circumstances.

How do I force gcc to inline a function?

Does __attribute__((always_inline)) force a function to be inlined by gcc?
Yes.
From documentation v4.1.2
From documentation latest
always_inline
Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
It should. I'm a big fan of manual inlining. Sure, used in excess it's a bad thing. But often times when optimizing code, there will be one or two functions that simply have to be inlined or performance goes down the toilet. And frankly, in my experience C compilers typically do not inline those functions when using the inline keyword.
I'm perfectly willing to let the compiler inline most of my code for me. It's only those half dozen or so absolutely vital cases that I really care about. People say "compilers do a good job at this." I'd like to see proof of that, please. So far, I've never seen a C compiler inline a vital piece of code I told it to without using some sort of forced inline syntax (__forceinline on msvc __attribute__((always_inline)) on gcc).
Yes, it will. That doesn't necessarily mean it's a good idea.
According to the gcc optimize options documentation, you can tune inlining with parameters:
-finline-limit=n
By default, GCC limits the size of functions that can be inlined. This flag
allows coarse control of this limit. n is the size of functions that can be
inlined in number of pseudo instructions.
Inlining is actually controlled by a number of parameters, which may be specified
individually by using --param name=value. The -finline-limit=n option sets some
of these parameters as follows:
max-inline-insns-single is set to n/2.
max-inline-insns-auto is set to n/2.
I suggest reading more in details about all the parameters for inlining, and setting them appropriately.
I want to add here that I have a SIMD math library where inlining is absolutely critical for performance. Initially I set all functions to inline but the disassembly showed that even for the most trivial operators it would decide to actually call the function. Both MSVC and Clang showed this, with all optimization flags on.
I did as suggested in other posts in SO and added __forceinline for MSVC and __attribute__((always_inline)) for all other compilers. There was a consistent 25-35% improvement in performance in various tight loops with operations ranging from basic multiplies to sines.
I didn't figure out why they had such a hard time inlining (perhaps templated code is harder?) but the bottom line is: there are very valid use cases for inlining manually and huge speedups to be gained.
If you're curious this is where I implemented it. https://github.com/redorav/hlslpp
Yes. It will inline the function regardless of any other options set. See here.
One can also use __always_inline. I have been using that for C++ member functions for GCC 4.8.1. But could not found a good explanation in GCC doc.
Actually the answer is "no". All it means is that the function is a candidate for inlining even with optimizations disabled.

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Resources