gcc loop unrolling oddity - gcc

In the course of writing a "not-equal scan" for Boolean arrays,
I ended up writing this loop:
// Heckman recursive doubling
#ifdef STRENGTHREDUCTION // Haswell/gcc does not like the multiply
for( s=1; s<BITSINWORD; s=s*2) {
#else // STRENGTHREDUCTION
for( s=1; s<BITSINWORD; s=s+s) {
#endif // STRENGTHREDUCTION
w = w XOR ( w >> s);
}
What I observed was that gcc WOULD unroll the s=s*2 loop,
but not the s=s+s loop. This is slightly non-intuitive, as
the loop-count analysis for addition should, IMO be simpler
than for multiply. I suspect that gcc DOES know the s=s+s
loop count, and is merely being coy.
Does anyone know if there is some good reason for this
behavior on gcc's part?
I am asking this out of curiosity...
[The unrolled version, BTW, ran a fair bit slower than the loop.]
Thanks,
Robert

This is interesting.
First guess
My first guess would be that gcc's loop unroll analysis expects the addition case to benefit less from loop unrolling because s grows more slowly.
I experiment on the following code:
#include <stdio.h>
int main(int argc, char **args) {
int s;
int w = 255;
for (s = 1; s < 32; s = s * 2)
{
w = w ^ (w >> s);
}
printf("%d", w); // To prevent everything from being optimized away
return 0;
}
And another version that is the same except the loop has s = s + s. I find that gcc 4.9.2 unrolls the loop in the multiplicative version but not the additive one. This is compiling with
gcc -S -O3 test.c
So my first guess is that gcc assumes the additive version, if unrolled, would result in more bytes of code that fit in the icache and therefore does not optimize. However, changing the loop condition from s < 32 to s < 4 in the additive version still doesn't result in an optimization, even though it seems gcc should easily recognize that there are very few iterations of the loop.
My next attempt (going back to s < 32 as the condition) is to explicitly tell gcc to unroll loops up to 100 times:
gcc -S -O3 -fverbose-asm --param max-unroll-times=100 test.c
This still produces a loop in the assembly. Trying to allow more instructions in unrolled loops with --param max-unrolled-insns retains the loop as well. Therefore, we can pretty much eliminate the possibility that gcc thinks it's inefficient to unroll.
Interestingly, trying to compile with clang at -O3 immediately unrolls the loop. clang is known to unroll more aggressively, but this doesn't seem like a satisfying answer.
I can get gcc to unroll the additive loop by making it add a constant and not s itself, that is, I do s = s + 2. Then the loop unrolls.
Second guess
That leads to me theorize that gcc is unable to understand how many iterations the loop will run for (necessary for unrolling) if the loop's increase value depends on the counter's value more than once. I change the loop as follows:
for (s = 2; s < 32; s = s*s)
And it does not unroll with gcc, while clang unrolls it. So my best guess, in the end, is that gcc fails to calculate the number of iterations when the loop's increment statement is of the form s = s (op) s.

Compilers routinely perform strength reduction, so I would expect that
gcc would use it here, replacing s*2 by s+s, at which point the forms of both
source code expressions would match.
If that is not the case, then I think it is a bug in gcc. The analysis
to compute the loop count using s+s is (marginally) simpler than that
using s*2, so I would expect that gcc would be (marginally)
more likely to unroll the s+s case.

Related

What's the right way to compute integral base-2 logarithms at compile-time?

I have some positive constant value that comes from a different library than mine, call it the_val. Now, I want log_of_the_val to be floor(log_2(the_val)) - not speaking in C++ code - and I want that to happen at compile time, of course.
Now, with gcc, I could do something like
decltype(the_val) log_of_the_val = sizeof(the_val) * CHAR_BIT - __builtin_clz(the_val) - 1;
and that should work, I think (length - number of heading zeros). Otherwise, I could implement a constexpr function myself for it, but I'm betting that there's something else, and simpler, and portable, that I could use at compile-time. ... question is, what would that be?
The most straightforward solution is to use std::log2 from <cmath>, but that isn't specified to be constexpr - it is under gcc, but not under clang. (Actually, libstdc++ std::log2 calls __builtin_log2, which is constexpr under gcc.)
__builtin_clz is constexpr under both gcc and clang, so you may want to use that.
The fully portable solution is to write a recursive constexpr integral log2:
constexpr unsigned cilog2(unsigned val) { return val ? 1 + cilog2(val >> 1) : -1; }

Sum all numbers from one to a billion in Haskell

Currently i'm catching up on Haskell, and I'm super impressed so far. As a super simple test I wrote a program which computes the sum up till a billion. In order to avoid list creation, I wrote a function which should be tail recursive
summation start upto
| upto == 0 = start
| otherwise = summation (start+upto) (upto-1)
main = print $ summation 0 1000000000
running this with -O2 I get a runtime of about ~20sec on my machine, which kind of surprised me, since I thought the compiler would be more optimising. As a comparison I wrote a simple c++ program
#include <iostream>
int main(int argc, char *argv[]) {
long long result = 0;
int upto = 1000000000;
for (int i = 0; i < upto; i++) {
result += i;
}
std::cout << result << std::end;
return 0;
}
compiling with clang++ without optimisation the runtime is ~3secs. So I was wondering why my Haskell solution is so slow. Has anybody an idea?
On OSX:
clang++ --version:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
ghc --version:
The Glorious Glasgow Haskell Compilation System, version 7.10.3
Adding a type signature dropped my runtime from 14.35 seconds to 0.27. It is now faster than the C++ on my machine. Don't rely on type-defaulting when performance matters. Ints aren't preferable for, say, modeling a domain in a web application, but they're great if you want a tight loop.
module Main where
summation :: Int -> Int -> Int
summation start upto
| upto == 0 = start
| otherwise = summation (start+upto) (upto-1)
main = print $ summation 0 1000000000
[1 of 1] Compiling Main ( code/summation.hs, code/summation.o )
Linking bin/build ...
500000000500000000
14.35user 0.06system 0:14.41elapsed 100%CPU (0avgtext+0avgdata 3992maxresident)k
0inputs+0outputs (0major+300minor)pagefaults 0swaps
Linking bin/build ...
500000000500000000
0.27user 0.00system 0:00.28elapsed 98%CPU (0avgtext+0avgdata 3428maxresident)k
0inputs+0outputs (0major+171minor)pagefaults 0swaps
Skip the strike-out unless you want to see the unoptimized (non -O2) view.
Lets look at the evaluation:
summation start upto
| upto == 0 = start
| otherwise = summation (start+upto) (upto-1)
main = print $ summation 0 1000000000
-->
summation 0 1000000000
-->
summations (0 + 1000000000) 999999999
-->
summation (0 + 1000000000 + 999999999) 999999998
-->
summation (0 + 1000000000 + 999999999 + 999999998) 999999997
EDIT: I didn't see that you had compiled with -O2 so the above isn't occuring. The accumulator, even without any strictness annotations, suffices most of the time with proper optimization levels.
Oh no! You are storing one billion numbers in a big thunk that you aren't evaluating! Tisk! There are lots of solutions using accumulators and strictness - it seems like most stackoverflow answers with anything near this question will suffice to teach you those in addition to library functions, like fold{l,r}, that help you avoid writing your own primitive recursive functions. Since you can look around and/or ask about those concepts I'll cut to the chase with this answer.
If you really want to do this the correct way then you'd use a list and learn that Haskell compilers can do "deforestation" which means the billion-element list is never actually allocated:
main = print (sum [0..1000000000])
Then:
% ghc -O2 x.hs
[1 of 1] Compiling Main ( x.hs, x.o )
Linking x ...
% time ./x
500000000500000000
./x 16.09s user 0.13s system 99% cpu 16.267 total
Cool, but why 16 seconds? Well by default those values are Integers (GMP integers for the GHC compiler) and that's slower than a machine Int. Lets use Int!
% cat x.hs
main = print (sum [0..1000000000] :: Int)
tommd#HalfAndHalf /tmp% ghc -O2 x.hs && time ./x
500000000500000000
./x 0.31s user 0.00s system 99% cpu 0.311 total

set RNG state with openMP and Rcpp

I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.

Auto vectorization on double and ffast-math

Why is it mandatory to use -ffast-math with g++ to achieve the vectorization of loops using doubles? I don't like -ffast-math because I don't want to lose precision.
You don’t necessarily lose precision with -ffast-math. It only affects the handling of NaN, Inf etc. and the order in which operations are performed.
If you have a specific piece of code where you do not want GCC to reorder or simplify computations, you can mark variables as being used using an asm statement.
For instance, the following code performs a rounding operation on f. However, the two f += g and f -= g operations are likely to get optimised away by gcc:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
f -= g;
return f;
}
On x86_64, you can use this asm statement to instruct GCC not to perform that optimisation:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
__asm__("" : "+x" (f));
f -= g;
return f;
}
You will need to adapt this for each architecture, unfortunately. On PowerPC, use +f instead of +x.
Very likely because vectorization means that you may have different results, or may mean that you miss floating point signals/exceptions.
If you're compiling for 32-bit x86 then gcc and g++ default to using the x87 for floating point math, on 64-bit they default to SSE, however the x87 can and will produce different values for the same computation so it's unlikely g++ will consider vectorizing if it can't guarantee that you will get the same results unless you use -ffast-math or some of the flags it turns on.
Basically it comes down to the floating point environment for vectorized code may not be the same as the one for non vectorized code, sometimes in ways that are important, if the differences don't matter to you, something like
-fno-math-errno -fno-trapping-math -fno-signaling-nans -fno-rounding-math
but first look up those options and make sure that they won't affect your program's correctness. -ffinite-math-only may help also
Because -ffast-math enables operands reordering which allows many code to be vectorized.
For example to calculate this
sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + … a[99]
the compiler is required to do the additions sequentially without -ffast-math, because floating-point math is neither commutative nor associative.
Is floating point addition commutative and associative?
Is floating point addition commutative in C++?
Are floating point operations in C associative?
Is Floating point addition and multiplication associative?
That's the same reason why compilers can't optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) without -ffast-math
That means no vectorization available unless you have very efficient horizontal vector adds.
However if -ffast-math is enabled, the expression can be calculated like this (Look at A7. Auto-Vectorization)
sum0 = a[0] + a[4] + a[ 8] + … a[96]
sum1 = a[1] + a[5] + a[ 9] + … a[97]
sum2 = a[2] + a[6] + a[10] + … a[98]
sum3 = a[3] + a[7] + a[11] + … a[99]
sum’ = sum0 + sum1 + sum2 + sum3
Now the compiler can vectorize it easily by adding each column in parallel and then do a horizontal add at the end
Does sum’ == sum? Only if (a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] + … This holds under associativity, which floats don’t adhere to, all of the time. Specifying /fp:fast lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.
Do You Prefer Fast or Precise? - A7. Auto-Vectorization
It may be enabled by the -fassociative-math flag in gcc
Further readings
Semantics of Floating Point Math in GCC
What does gcc's ffast-math actually do?
To enable auto-vectorization with gcc, ffast-math is not actually necessary. See https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.
Using -fassociative-math should be sufficient.
This has been the case since 2007, see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#oldnews
-fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats (2007-09-04).

Loop versioning with GCC

I am working on auto vectorization with GCC. I am not in a position to use intrinsics or attributes due to customer requirement. (I cannot get user input to support vectorization)
If the alignment information of the array that can be vectorized is unknown, GCC invokes a pass for 'loop versioning'. Loop versioning will be performed when loop vectorization is done on trees. When a loop is identified to be vectorizable, and the constraint on data alignment or data dependence is hindering it, (because they cannot be determined at compile time), then two versions of the loop will be generated. These are the vectorized and non-vectorized versions of the loop along with runtime checks for alignment or dependence to control which version is executed.
My question is how we have to enforce the alignment? If I have found a loop that is vectorizable, I should not generate two versions of the loop because of missing alignment information.
For example. Consider the below code
short a[15]; short b[15]; short c[15];
int i;
void foo()
{
for (i=0; i<15; i++)
{
a[i] = b[i] ;
}
}
Tree dump (options: -fdump-tree-optimized -ftree-vectorize)
<SNIP>
vector short int * vect_pa.49;
vector short int * vect_pb.42;
vector short int * vect_pa.35;
vector short int * vect_pb.30;
bb 2>:
vect_pb.30 = (vector short int *) &b;
vect_pa.35 = (vector short int *) &a;
if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0) ;; <== (A)
goto <bb 3>;
else
goto <bb 4>;
bb 3>:
</SNIP>
At 'bb 3' version of vectorized code is generated. At 'bb 4' code without vectorization is generated. These are done by checking the alignment (statement 'A'). Now without using intrinsics and other attributes, how should I get only the vectorized code (without this runtime alignment check.)
If the data in question is being allocated statically, then you can use the __align__ attribute that GCC supports to specify that it should be aligned to the necessary boundary. If you are dynamically allocating these arrays, you can over-allocate by the alignment value, and then bump the returned pointer up to the alignment you need.
You can also use the posix_memalign() function if you're on a system that supports it. Finally, note that malloc() will always allocate memory aligned to the size of the largest built-in type, generally 8 bytes for a double. If you don't need better than that, then malloc should suffice.
Edit: If you modify your allocation code to force that check to be true (i.e. overallocate, as suggested above), the compiler should oblige by not conditionalizing the loop code. If you needed alignment to an 8-byte boundary, as it seems, that would be something like a = (a + 7) & ~3;.
I get only one version of the loop, using your exact code with these options: gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c
My version of GCC is gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8).
GCC is doing something clever here. It forces the arrays a and b to be 16-byte aligned. It doesn't do that to c, presumably because c is never used in a vectorizable loop.

Resources