Why doesn't my OpenMP simd directive have any use?

Why doesn't my OpenMP simd directive have any use? - openmp

I have tried these codes to test SIMD directive in OpenMP.
#include <iostream>
#include <sys/time.h>
#include <cmath>
#define N 4096
#define M 1000
using namespace std;
int main()
{
timeval start,end;
float a[N],b[N];
for(int i=0;i<N;i++)
b[i]=i;
gettimeofday(&start,NULL);
for(int j=0;j<M;j++)
{
#pragma omp simd
for(int i=0;i<N;i++)
a[i]=pow(b[i],2.1);
}
gettimeofday(&end,NULL);
int time_used=1000000*(end.tv_sec-start.tv_sec)+(end.tv_usec-start.tv_usec);
cout<<"time_used="<<time_used<<endl;
return 1;
}
But either I compiled it by
g++ -fopenmp simd.cpp
or
g++ simd.cpp
their reports for "time_used" are almost the same.It looks like the SIMD directive I used doesn't have any use?
Thanks!
Additional questions:
I replaced
a[i]=pow(b[i],2.1);
by
a[i]=b[i]+2.1;
and when I compile them by
g++ -fopenmp simd.cpp
the output of "time_used" is about 12000.
When I compile them by
g++ simd.cpp
the output of "time_used" is about 12000,almost the same as before.
My computer: Haswell i5,8g RAM,ubuntu kylin 16.04,gcc 5.4.0

The compiler can't auto-vectorize function calls. It can only vectorize specific arithmetic operations that can be done using SIMD instructions.
Therefore, you need a vector math library that implements the pow function using SIMD instructions. Intel provides one. I'm not sure if pow is one of the functions that it offers with vector optimizations, but I imagine it is. You should also beware that Intel's math library may not be optimal on AMD processors.
You claim that you tried changing the pow function call to a simple addition, but didn't see any improvement in the results. I'm not quite sure how that is possible, because if you change the inner loop from:
a[i]=pow(b[i],2.1);
to, say:
a[i] += b[i];
or:
a[i] += (b[i] * 2);
then GCC, with optimizations enabled, notices that you never use the result and elides the entire thing. It was unable to perform this optimization with the pow function call, because it didn't know whether the function had any other side-effects. However, with code that is visible to the optimizer, it can…well, optimize it. In some cases, it might be able to vectorize it. In this case, it was able to remove it entirely.
If you tried code where the optimizer removed this loop entirely, and you still didn't see an improvement on your benchmark scores, then clearly this is not a bottleneck in your code and you needn't worry about trying to vectorize it.

Related

Why don't gcc/clang vectorize 128-bit SIMD intrinsics into 256-bit when possible?

Suppose I have this function:
void test32(int* a, int* b, size_t n) {
for (size_t i = 0; i < n; ++i) {
a[i] = a[i] + b[i];
}
}
Clang and gcc both produce 256-bit SIMD when compiled with -O3 -march=core-avx2 (godbolt).
Now suppose I have this function:
void test128(__m128i* a, __m128i* b, size_t n) {
for (size_t i = 0; i < n; ++i) {
a[i] = _mm_add_epi32(a[i], b[i]);
}
}
With the same CFLAGS, clang and gcc both refuse to vectorize this to 256-bit (godbolt).
The naive code (auto-vectorized) therefore processes twice as many elements per iteration compared to the manually vectorized SSE2 code. How does this make sense? Is there a way to instruct the compiler to vectorize 128-bit SIMD intrinsics into 256-bit when AVX2 is available?

Unfortunately no, I don't know of a compiler option to re-vectorize intrinsics (or GNU C native vectors) to a wider type. That's one reason not to manually vectorize in the first place for cases that easily auto-vectorize.
It's sometimes useful to be able to tell the compiler what vectorization strategy you want it to use, and that's what intrinsics are for.
If compilers rewrote them too aggressively, that would be bad in some cases. Like cleanup loops following a loop with wider vectors, you might use 128-bit to leave fewer scalar elements after. Or maybe you have 16-byte alignment but not 32, and you care about Sandybridge specifically (where misaligned 32-byte load/store are quite bad). Or you're on Haswell server where 256-bit AVX can reduce max turbo (at least for FP math instructions), so you only want to use 256-bit vectors in some functions that will be running in some phases of your program.
Basically it's a tradeoff between how close to writing in asm it is, for clever humans to specify what they want, vs. just giving a way to tell the compiler about the program logic in a way it can understand and optimize (like the + operator: that doesn't mean you'll get an asm add instruction).
MSVC and ICC tend to take intrinsics even more literally than GCC/clang, not doing constant propagation through them. GCC/clang's choice of how to treat intrinsics is sensible in a lot of ways.
If you have a trivially vectorizable problem like this (no loop carried dependencies or shuffles), and you want your code to be compilable for future wider vector instructions, use OpenMP #pragma omp simd to tell the compiler you definitely want it vectorized. This kind of problem is what OpenMP is for, if you don't enable full optimization to get compilers to try to auto-vectorize every loop. (At gcc -O3, or clang -O2. Also -O2 for GCC12.) OpenMP can get the compiler to vectorize FP math in ways that it normally wouldn't be allowed to without -ffast-math, e.g. FP reductions like sum of an array.

_mm512_mask_i32logather_pd not available for GNU compiler

I have a codebase which contains AVX512 intrinsic instructions and was build using intel compiler. I am trying to run the same thing using GNU compiler. While compiling the code with -mavx512f flag using gcc, I am getting declaration error only for some AVX512 instructions like _mm512_mask_i32logather_pd.
Standalone Implementation
#include <iostream>
#include <immintrin.h>
int main() {
__m512d set = _mm512_undefined_pd();
__mmask16 msk = 42440;
__m512i v_index = _mm512_set_epi32(64,66,70,96,98,100,102,104,106,112,114,116,118,120,124,256);
int scale = 8;
int count_size = 495*4;
float *src_ptr = (float*)malloc(count_size*sizeof(float));
__m512 out_512 = (__m512)_mm512_mask_i32logather_pd(set, msk, v_index, (float*)src_ptr, _MM_SCALE_8);
return 0;
}
After running this standalone implementation for the function through gcc I am getting the error as
error: ‘_mm512_mask_i32logather_pd’ was not declared in this scope; did you mean ‘_mm512_mask_i32gather_pd’?
Running the same code using icc with -xCORE-AVX512 flag runs perfectly fine.
Is this because the GNU compiler doesn't support all the AVX512 instructions even though most of the instructions works perfectly fine by using -mavx512f flag?
Relevant information
gcc version - 11.2.0
ubuntu version - 22.04
icc version 2021.6.0

GCC has intrinsics for all AVX-512 instructions. It doesn't always have every alternate version of every intrinsic that differ only in their C semantics, not the underlying instruction they expose.
I think the only difference between the regular _mm512_mask_i32gather_pd intrinsic (which GCC supports) is that logather takes a __m512i vindex instead of __m256i. But uses only the low half, hence the lo in the name. (I looked at them in the intrinsics guide - same pseudocode, just a difference in C/C++ function signature. And they're listed as intrinsics for the same single instruction). There doesn't seem to be a higather intrinsic that includes a shuffle; you need to do the extracting yourself.
vgatherdpd gathers 8 double elements to fill a __m512d, using 32-bit indices. The corresponding 8 indices are only a total of 32 bytes wide. That's why the regular more widely-supported intrinsic only takes a __m256i vindex arg.
Your code strangely bothers to initialize 64 bytes (16 indices), not shuffling the high half down. Also you're merge-masking into _mm512_undefined_pd(), which seems a weird example. But pretty obviously this isn't intended to be useful, since you're also loading from uninitialized malloc. You're casting the result to a __m512, I guess using this instruction to gather pairs of float instead of individual doubles? If so, yeah it's more efficient to gather fewer elements, but it's a weird way to make a minimal simple example for an intrinsic you're looking for. I wonder if perhaps you were looking for _mm512_mask_i32gather_ps to gather 16x float elements, merging into a __m512 vector. (The non-_mask_ version gathers all 16 elements, and you don't have to supply a merge target; that's often what you want.)
If you do have your 8 indices in a wider vector for some reason (e.g. as a result of computation and you're going to do 2 gathers after shuffling), you can just cast the vector type:
__m512i vindex = ...; // the part we want is only the low half
__m512d result = something to merge into;
result = _mm512_mask_i32gather_pd(result, mask, _mm512_castsi512_si256(vindex),
src_ptr, _MM_SCALE_8);
Your cast to (float*) in the arg list to the intrinsic makes no sense: it actually takes a void* so you can gather 64-bit chunks from anything (and yes it's strict-aliasing and alignment safe, not following C rules). But the normal type would be double*, since this is a _pd gather.
In your example, it would be simpler to just __m256 vindex = _mm256_setr_epi32(...); (Or set, if you like the highest-element-first order for the argument list.)

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.

You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

Reduced efficiency of open MP over time

I have a large calculation algorithm where I use open mp to help speed it up. It works fine for the first 50 or so iterations (i.e. until y = 50 or so), but then starts to slow down progressively. I also notice that the CPU usage goes from ~100% to ~40% by the end.
The code looks something like this:
#include <iostream>
#include <omp.h>
#include <ipp.h>
void main(){
std::string filename = "Large_File.file";
FILE * fid = fopen(filename.c_str(), "rb");
Ipp32f* vector= ippsMalloc_32f(100000000);
for (int y=0; y<300; y++){
fread(vector,sizeof(float),100000000,fid);
#pragma omp parallel for
for (int x=0; x<300; x++){
//A time-consuming calculation
}
}
}

First of all, check if this exactly the same question and the same answer as given stackoverflow link: Why does moving the buffer pointer slow down fread (C programming language)?
(It's not very far from what Hristo suggested)
Secondly, from the way you stated your question and "common sense" it is most likely that the slowdown is driven by "fread" call (as an only thing to vary with "y" increase from what could be seen in your "restricted" example).
Third side comment: since you use IPP, you are likely to use fresh (likely Intel) Compiler. If so, I would suggest using more novel ways of allocating memory in aligned manner: either _aligned_malloc() or using #include <aligned_new> read more
(I assume you expect to hear more about OpenMP-specific slow-downs but it's pretty low probability that there is any relationship wrt threading in your case; you may possibly verify it by disabling OpenMP and comparing y=50, y=100, y=150, .. runs, although it will not prove a lot in case you really deal with sophisticated NUMA/threading composability issues which again I don't beleive is the case)

Is it possible to bring GCC into an infinite loop?

Is it possible to bring GCC into an infinite loop by inputting strange source code? And if yes, how? Maybe one could do something with Template Metaprogramming?

Yes.
Almost every computer program has loop termination problems. I'm thinking that GCC, however, would run out of RAM before an infinite loop ever becomes obvious. There aren't many "free" operations in its design.
The parser & preprocessor wouldn't create problems. I'm willing to bet that you could target the optimizer, which would likely have more implementation faults. It would be less about the language and more about exploiting a flaw you could discover from the source code. i.e. the exploit would be non-obvious.
UPDATE
In this particular case, my theory seems correct. The compiler keeps allocating RAM and the optimizer does seem to be vulnerable. The answer is yes. Yes you can.

Bugs are particularly transient, for example #Pestilence's answer was found in GCC 4.4.0 and fixed in 4.4.1. For a list of current ways to bring GCC to an infinite loop, check their Bugzilla.
EDIT: I just found a new way, which also crashes Comeau. This is a more satisfying answer, for now. Of course, it should also be fixed soon.
template< int n >
struct a {
a< n+1 > operator->() { return a< n+1 >(); }
};
int main() {
a<0>()->x;
}

Since C++ template metaprogramming is in fact Turing complete you can make a never ending compilation.
For example:
template<typename T>
struct Loop {
typedef typename Loop<Loop<T> >::Temp Temp;
};
int main(int, char**) {
Loop<int> n;
return 0;
}
However, like the answer before me. gcc has a flag to stop this from continuing endlessly (Much like a stack overflow in an infinite recursion).

Bentley writes in his book "Programming Pearls" that the following code resulted in an infinite loop during optimized compilation:
void traverse(node* p) {
traverse(p->left);
traverse(p->right);
}
He says "the optimizer tried to convert the tail recursion into a loop, and died when it could find a test to terminated the loop." (p.139) He doesn't report the exact compiler version where that happened. I assume newer compilers detect the case.

It may be possible. But most compilers (and most standardised languages) have limits on things like recursion depths in templates or include files, at which point the compiler should bail out with a diagnostic. Compilers that don't do this are not normally popular with users.

Don't know about gcc, but old pcc used to go into an infinite loop compiling some kinds of infinite loops (the ones that compiled down to _x: jmp _x).

I think you could do it with #include
Just #include "file1.c" into file2.c and #include "file2.c" in file1
suggestion causes compiler to loop a lot then fail, not loop infinitely

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio