gcc optimization affects bounds checking - gcc

Consider the following example:
int a[4];
int main() {
a[4] = 12; // <--
return 0;
}
This is clearly an out of bounds error, is it not? I was wondering when gcc would warn about this, and found that it will only do so if optimisation is -O2 or higher (this is affected by the -ftree-vrp option that is only set automatically for -O2 or higher).
I don't really see why this makes sense and whether it is correct that gcc does not warn otherwise.
The documentation has this to say about the matter:
This allows the optimizers to remove unnecessary range checks like array bound checks and null pointer checks.
Still, I don't see why that check should be unnecessary?

Your example is a case of constant propagation, not value range propagation, and it certainly triggers a warning on my version of gcc (4.5.1) whether or not -ftree-vrp is enabled.
In general, Java and Fortran are the only languages supported by gcc which (Java by default, and Fortan if you explicitly ask for it with -fbounds-check) will generate code for checking array bounds.
However, although C/C++ does not support any such thing, the compiler will still warn you at compile time if it believes that something is amiss. For constants, this is pretty obvious, for variable ranges, it is somewhat harder.
The clause "allows the compiler to remove unnecessary range checks" relates to cases where for example you use an unsigned 8 bit wide variable to index into an array that has >256 entries or an unsigned 16 bit value to index into an array of >65536 elements. Or, if you iterate over an array in a loop, and the (variable) loop counter is bounded by values that can be proven as compile-time constants which are legal array indices, so the counter can never possibly go beyond the array bounds.
In such cases, the compiler will neither warn you nor generate any code for target languages where this is supported.

Related

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

OpenACC Scheduling

Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.
The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.
According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf

Compiler assumptions about relative locations from memory objects

I wonder what assumptions compilers make about the relative locations of memory objects.
For example if we allocate two stack variables of size 1 byte each, right after another and initialize them both with zero, can a compiler optimize this case by only emitting one single instruction that overwrites both bytes in memory with zeros, because the compiler knows the relative position of both variables?
I am interested specifically in the more well known compilers like gcc, g++, clang, the Windows C/C++ compiler etc.
A compiler can optimize multiple assignments into one.
a = 0;
b = 0;
might become something like
*(short*)&a = 0;
The subtle part is "if we allocate two stack variables of size 1 byte each, right after another" since you cannot really do that. A compiler can shuffle stack positions around at will. Also, simply declaring variables will not necessarily mean any stack allocation. Variables might just be in registers. In C you would have to use alloca and even that does not provide "right after another".
Even more general, the C standard does not allow you to compare the memory positions of different objects. This is undefined behavior.

What is the purpose of gcc's -Wbad-function-cast?

As advised by an answer here, I turned on -Wbad-function-cast to see if my code had any bad behavior gcc could catch, and it turned up this example:
unsigned long n;
// ...
int crossover = (int)pow(n, .14);
(it's not critical here that crossover is an int; it could be unsigned long and the message would be the same).
This seems like a pretty ordinary and useful example of a cast. Why is this problematic? Otherwise, is there a reason to keep this warning turned on?
I generally like to set a lot of warnings, but I can't wrap my mind around the use case for this one. The code I'm working on is heavily numerical and there are lots of times that things are cast from one type to another as required to meet the varying needs of the algorithms involved.
You'd better to take this warning seriously.
If you want to get integer from floating-point result of pow, it is rounding operation, which must be done with one of standard rounding functions like round. Doing this with integer cast may yield in surprises: you generally loose the fractional part and for instance 2.76 may end up as 2 with integer truncation, just as 2.12 would end up as 2. Even if you want this behavior, you'd better to specify it explicitly with floor function. This will increase readability and supportability of your code.
The utility of the -Wbad-function-cast warning is limited.
Likely, it is no coincidence that neither -Wall nor -Wextra enable that warning. As well as it is not available for C++ (it is C/Objective-C only).
Your concrete example doesn't exploit undefined behavior nor implementation defined behavior (cf. ISO C11, Section 6.3.1.4). Thus, this warning gives you zero benefits.
In contrast, if you try to rewrite your code to make -Wbad-function-cast happy you just add superfluous function calls that even recent GCC/Clang compilers don't optimize away with -O3:
#include <math.h>
#include <fenv.h>
int f(unsigned n)
{
int crossover = lrint(floor(pow(n, .14)));
return crossover;
}
(negative example, no warning emitted with -Wbad-function-cast but superfluous function calls)

Can I get the behavior of -Wfloat-equal for all comparisons except literal zero?

I'd like to enable -Wfloat-equal in my build options (which is a GCC flag that issues a warning when two floating pointer numbers are compared via the == or != operators). However, in several header files of libraries I use, and a good portion of my own code, I often want to branch for non-zero values of a float or double, using if (x) or if (x != 0) or variations of that.
Since in these cases I am absolutely sure the value is exactly zero - the values checked are the result of an explicit zero-initialization, calloc, etc. - I cannot see a downside to using this comparison, rather than the considerably more expensive and less readable call to my near(x, 0) function.
Is there some way to get the effect of -Wfloat-equal for all other kinds of floating point equality comparisons, but allow these to pass unflagged? There are enough instances of them in library header files that they can significantly pollute my warning output.
From the question you ask, it seems like the warning is entirely appropriate. If you're comparing against exact zero to test if data still has its initial zero value from calloc (which is actually incorrect from a standpoint of pure C, but works on any IEEE 754 conformant implementation), you could get false positives from non-zero values having been rounded to zero. In other words it sounds like your code is incorrect.
It's pretty horrible, but this avoids the warning:
#include <functional>
template <class T>
inline bool is_zero(T v)
{
return std::equal_to<T>()(v, 0);
}
GCC doesn't report warnings for system headers, and that causes the equality test to happen inside a system header.

Resources