I am getting different results from the same code when I compile it with the Intel Fortran compiler and the Portlan Group compiler. What would be the closest flag in pgi Fortran to the -fp-model precise in ifort?
Is there a set of flags in pgi that would match the the combination of -O2 -fp-model precise ifort flags? Thanks
Programs compiled with different compilers (or even different versions of the same compiler) are not expected to produce exactly the same results. Different levels of optimization (-On flag) are also not equivalent between compilers (except -O0 which requests no optimization at all).
I do not think there is an equivalent flag in PGI to ifort's -fp-model precise, but you may want to look into target-specific flags in the manual for PGI Fortran compiler, and more specifically, these:
-K[no]ieee Use IEEE division, optionally enable traps
-Ktrap=align|denorm|divz|fp|inexact|inv|none|ovf|unf
Determine IEEE Trap conditions
-M[no]daz Treat denormalized numbers as zero
-M[no]flushz Set SSE to flush-to-zero mode
-M[no]fpapprox[=div|sqrt|rsqrt]
Perform certain fp operations using low-precision approximation
div Approximate floating point division
sqrt Approximate floating point square root
rsqrt Approximate floating point reciprocal square root
-Mfpapprox Approximate div,sqrt,rsqrt
-M[no]fpmisalign Allow use of vector arithmetic instructions for unaligned operands
-M[no]fprelaxed[=div|recip|sqrt|rsqrt|[no]order]
Perform certain fp operations using relaxed precision
div Perform divide with relaxed precision
recip Perform reciprocal with relaxed precision
sqrt Perform square root with relaxed precision
rsqrt Perform reciprocal square root with relaxed precision
[no]order Allow expression reordering, including factoring
-Mfprelaxed Choose which operations depending on target processor
It is acceptable for the program's output to be different at some less significant digit between different compilers. If your results are very different, your algorithm might not be very robust and may need work.
Related
I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.
Suppose you are trying to multiply a floating-point number k by a small integer constant n (by small I mean -20 <= n <= 20). The naive way of doing this is converting n to a floating point number (which for the purposes of this question does not count towards the runtime) and executing a floating-point multiply. However, for n = 2, it seems likely that k + k is a faster way of computing it. At what n does the multiply instruction become faster than repeated additions (plus an inversion at the end if n < 0)?
Note that I am not particularly concerned about accuracy here; I am willing to allow unsound optimizations as long as they get roughly the right answer (i.e.: up to 1024 ULP error is probably fine).
I am writing OpenCL code, so I'm interested in the answer to this question in many computational contexts (x86-64, x86-64 + AVX256, GPUs).
I could benchmark this, but since I don't have a particular architecture in mind, I'd prefer a theoretical justification of the choice.
According to AMD's OpenCL optimisation guide for GPUs, section 3.8.1 "Instruction Bandwidths", for single-precision floating point operands, addition, multiplication and 'MAD' (multiply-add) all have a throughput of 5 per cycle on GCN based GPUs. The same is true for 24-bit integers. Only once you move to 32-bit integers are multiplications much more expensive (1/cycle). Int-to-float conversions and vice versa are also comparatively slow (1/cycle), and unless you have a double-precision float capable model (mostly FirePro/Radeon Pro series or Quadro/Tesla from nvidia) operations on doubles are super slow (<1/cycle). Negation is typically "free" on GPUs - for example GCN has sign flags on instruction operands, so -(a + b) compiles to one instruction after transforming to (-a) + (-b).
Nvidia GPUs tend to be a bit slower at integer operations, for floats it's a similar story to AMD's though: multiplications are just as fast as addition, and if you can combine them into MAD operations, you can double throughput. Intel's GPUs are quite different in other regards, but again they're very fast at FP multiplication and addition.
Basically, it's really hard to beat a GPU at floating-point multiplication, as that's essentially the one thing they're optimised for.
On the CPU it's typically more complicated - Agner Fog's optimisation resources and instruction tables are the place to go for the details. Note though that on many CPUs you'll pay a penalty for interpreting float data as integer and back because ALU and FPU are typically separate. (For example if you wanted to optimise multiplying floats by a power of 2 by performing an integer addition on their exponents. On x86, you can easily do this by operating on SSE or AVX registers using first float instructions, then integer ones, but it's generally not good for performance.)
For specifics I am talking about x87 PC architecture and the C compiler.
I am writing my own interpreter and the reasoning behind the double datatype confuses me. Especially where efficiency is concerned. Could someone explain WHY C has decided on a 64-bit double and not the hardware native 80-bit double? And why has the hardware settled on an 80-bit double, since that is not aligned? What are the performance implications of each? I would like to use an 80-bit double for my default numeric type. But the choices of the compiler developers make me concerned that this is not the best choice.
double on x86 is only 2 bytes shorter, why doesn't the compiler use the 10 byte long double by default?
Can I get an example of the extra precision gotten by 80-bit long double vs double?
Why does Microsoft disable long double by default?
In terms of magnitude, how much worse / slower is long double on typical x86/x64 PC hardware?
The answer, according to Mysticial, is that Microsoft uses SSE2 for its double data-type. The Floating point unit (FPU) x87 is seen as outdated and slow in comparison to modern CPU extensions. SSE2 does not support 80-bit, hence the compiler's choice of 64-bit precision.
On 32-bit x86 architecture, since all CPUs don't have SSE2 yet, Microsoft still uses the floating point unit (FPU) x87 unless the compiler switch /arch:SSE2 is given. Which makes the code incompatible with those older? CPUs.
Wrong question.
It has nothing to do with C, all languages use AFAIK as standard floating-point single precision with 32 bit and double precision with 64 bit. C as a language supporting different
hardware defines only
sizeof(float) <= sizeof(double) <= sizeof(long double)
so it is perfectly acceptable that a specific C compiler uses 32bit floats for all datatypes.
Intel decided on Kahans advise that they support as much precision as possible and that calculations in less precise formats (32 & 64 bit) should be performed internally with 80bit precision.
The difference in precision and exponent range: 64bit has approx. 16 decimal digits and a max exponent of 308, 80bit has 19 digits and a max exponent of 4932.
Being much more precise and having a far greater exponent range you can calculate intermediate results without overflow or underflow and your result has less rounding errors.
So the question is why long double does not support 80bit. In fact many compilers did support it, but a lack of use and the run for benchmark performance killed it effectively.
This is actually so many questions in one, some of which are even too broad
Could someone explain WHY C has decided on a 64-bit double and not the hardware native 80-bit double?
It's irrelevant to C, because the C standard only mandates the minimum requirements for the built-in types and it's entirely up to the compiler implementation to choose whatever format they want to use for a type. Nothing prevents a C compiler to use some custom-made 77-bit floating-point type
And why has the hardware settled on an 80-bit double, since that is not aligned? What are the performance implications of each?
It's aligned to a multiple of 2 bytes. Remember that x87 dates back to 8086 + 8087.
It's a good trade-off for modern hardware implementers and software writers who needs more precision for exact rounding in double operations. Too big of a type and you'll need significantly more transistors. Double the number of bits in the significand and the multiplier will need to be 4 times as big
William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An Extended format as wide as we dared (80 bits) was included to serve the same support role as the 13-decimal internal format serves in Hewlett-Packard’s 10-decimal calculators." Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087, and that the x87 extended precision was designed to be extensible to higher precision in future processors: "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed.
https://en.wikipedia.org/wiki/Extended_precision#IEEE_754_extended_precision_formats
As you can see, with the 64-bit significand you can share the components (adder, multiplier...) with the integer ALU.
I would like to use an 80-bit double for my default numeric type. But the choices of the compiler developers make me concerned that this is not the best choice. double on x86 is only 2 bytes shorter, why doesn't the compiler use the 10 byte long double by default?
It's actually intended for using as a temporary variable (like tmp = (b*c + d)/e) to avoid intra overflow or underflow issues without special techniques like the Kahan summation. It's not your default floating-point type. In fact so many people use floating-point literals incorrectly when they use long double or float. They forgot to add the correct suffix which results in a lack of precision and then they ask why long double is just exactly the same as double. In summary, double should be used for almost every cases, unless you're limited by bandwidth or precision and you really know what you're doing
Can I get an example of the extra precision gotten by 80-bit long double vs double?
You can print the full value and see it your own. There are also a lot of questions that are worth reading
What are the applications/benefits of an 80-bit extended precision data type?
Difference between long double and double in C and C++
I want to know the difference between a long double and a double
Why does Microsoft disable long double by default?
Microsoft doesn't disable long double by default. They just choose to map long double to IEEE-754 double precision which incidentally the same format as double. The type long double can still be used normally. They did that because math on SSE is faster and more consistent. That way you'll avoid "bugs" like the below
Why casting double to int might give different results?
Apparently identical math expressions with different output
Why would the same code yield different numeric results on 32 vs 64-bit machines?
std::pow produce different result in 32 bit and 64 bit application
...
Besides 64-bit long double doesn't have the odd size which requires compiler to pad 6 zero bytes more (or deal with a non-power-of-2 type width) which is a waste of resources.
That said, it's not even that 80-bit long double is not available on x86. Currently only MSVC abandoned the extended precision type, other compilers for x86 (like GCC, Clang, ICC...) still support it and made 80-bit IEEE-754 the default format for long double. For example GCC has -mlong-double-64/80/128 and -m96/128bit-long-double to control the exact format of long double
Or without potentially breaking ABI compatibility by changing long double, you can use GNU C floating point type names like __float80 on targets that support it. This example on Godbolt compiles to 80-bit FP math whether it targets Windows or Linux.
In terms of magnitude, how much worse / slower is long double on typical x86/x64 PC hardware?
This cannot be answered because latency and throughput depends on each specific microarchitecture. However if you do a lot of floating-point operations then double will be significantly faster, because it has fewer bits in the significand, and it can be parallelized with SIMD. For example you can working on a vector of 8 doubles at a time with AVX-512. That can't be done with the extended precision type
Also, 80-bit x87 fp load and store instructions are significantly slower than the "normal" versions that convert to/from 32 or 64-bit, and only fstp is available, not fst. See Peter Cordes's answer on retrocomputing about x87 performance on modern CPUs. (In fact that's a cross-site duplicate of this, asking why MSVC doesn't expose an 80-bit x87 type as long double.)
I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.
Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:
SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)
The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)
The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.
My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?
In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.
But before I continue, time for some background terminology:
Full Word Arithmetic
Partial Word Arithmetic
Full-Word Arithmetic:
This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers.
Many bignum libraries and applications (including GMP) use this representation.
In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.
It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.
Partial-Word Arithmetic
This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.
The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.
In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.
Problems with Full-Word Arithmetic:
Vectorizing full-word arithmetic has historically been a lost cause:
SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*
*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.
Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:
Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.
In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.
With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.
AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.
But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".
Vectorization of Partial-Word Arithmetic
With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.
So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.
Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.
Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.
Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.
AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.
Large Multiplication using FFTs
For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).
FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is
a partial word representation.
To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.
If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.
But nevertheless, bignum arithmetic is possible to vectorize.
Disclosure:
I'm the author of y-cruncher which does plenty of large number arithmetic.
I've been given a 2D matrix representing temperature points on the surface of a metal plate. The edges of the matrix (plate) are held constant at 20 degrees C and there is a constant heat source of 100 degrees C at one pre-defined point. All other grid points are initially set to 50 degrees C.
My goal is to take all interior grid points and compute its steady-state temperature by iteratively averaging over the surrounding four grid points (i+1, i-1, j+1, j-1) until I reach convergence (a change of less than 0.02 degrees C between iterations).
As far as I know, the order in which I iterate over the grid points is irrelevant.
To me, this sounds like a fine time to invoke the Fortran FORALL construct and explore the joys of parallelization.
How can I ensure that the code is indeed being parallelized?
For example, I can compile this on my single-core PowerBook G4 and I would expect no improvement in speed due to parallelization. But if I compile on a Dual Core AMD Opteron, I would assume that the FORALL construct can be exploited.
Alternatively, is there a way to measure the effective parallelization of a program?
Update
In response to M.S.B's question, this is with gfortran version 4.4.0. Does gfortran support automatic multi-threading?
That's remarkable that the FORALL construct has been rendered obsolete by, I suppose, what is then auto-vectorization.
Perhaps this is best for a separate question, but how does auto-vectorization work? Is the compiler able to detect that only pure functions or subroutines are being used in a loop?
FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.
Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.
With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:
!$omp parallel workshare
FORALL (i=,j=,...)
<assignment>
END FORALL
!$omp end parallel
gfortran -fopenmp blah.f90 -o blah
Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the RHS as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated by the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.
If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:
FORALL loop at line X in file Y has been vectorized
I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.
The best way is to measure the clock time of the calculation. Try it with and without parallel code. If the clock time decreases, then your parallel code is working. The Fortran intrinsic system_clock, called before and after the code block, will give you the clock time. The intrinsic cpu_time will give you the cpu time, which might go up when code in run multi-threaded due to overhead.
The lore is the FORALL is not as useful as was thought when introduced into the language -- that it is more of a initialization construct. Compilers are equally adept at optimizing regular loops.
Fortran compilers vary in their abilities to implement true parallel processing without it being explicitly specified, e.g., with OpenMP or MPI. What compiler are you using?
To get automatic multi-threading, I've used ifort. Manually, I've used OpenMP. With both of these, you can compile your program with and without the parallelization and measure the difference.