ICC, GCC and OpenMP - gcc

I am launching a given problem that parallelizates by means of OpenMP. It runs a given number of iterations of the same piece of code that processes a volume of data. Is in that level where OpenMP is applied, making each thread process a subvolume. Every iteration should have the same workload, as well as every subvolume.
When compiled with ICC, iterations last always the same amount of time, as expected. But there comes the weird thing: when compiled with GCC, the time per iteration starts to increase, reaches a maximum and then decreases once again until it reaches a given value where it stabilises. The same program compiled without OpenMP makes no difference when using ICC or GCC.
Does anyone observed that behaviour in OpenMP in those compilers?
[EDIT 1]: guided and static scheduling policies have been tested.
[EDIT 2]: The code looks somewhat like this:
#pragma omp parallel for schedule(static) private(i,j,k)
for(i = 0; i < N; i++)
for(j = 0; j < N; j++)
for(k = 0; k < N; k++){
a[ k+j*N+i*NN] = 0.f;
b[ k+j*N+i*NN] = 0.f;
c[ k+j*N+i*NN] = 0.f;
d[ k+j*N+i*NN] = 0.f;
}
for( t = 0; t < T; t+=dt){
/* ... change some discrete values in a,b,c .... */
/* and propagate changes */
#pragma omp parallel for schedule(static) private(i,j,k)
for(i = 0; i < N; i++)
for(j = 0; j < N; j++)
for(k = 0; k < N; k++){
d[ k+j*N+i*NN ] = COMP( a,b,c,k+j*N+i*NN );
}
}
Where COMP performs some kind of linear application of values in a,b,c in the position k+j*N+i*NN (and some of their neighbours). The point is that this code in GCC and ICC caused the problem I described. The point is that I found out that I change the initialisation of a,b,c,d to some value other than 0.0f (f.ex, 0.5f) that thing that the time spent per time step increases doesn't occur.
[EDIT 3] : It seems is not GOMP's fault. The same happens with OpenMP disabled. Once again, with ICC (without or with openmp) doesn't occur at all. Is there any way I can close this thread?

May be, the COMP is doing some denormal operations, which are done in software, not in hardware.
Working on denormals can vary the run time comparing with Flush-to-zero mode (when every denormals is rounded to zero). There will be more work to done in compiler which does denormals calculation fairly. And amount of work can vary between iterations.
Intel Compiler by default disables denormal operations and sets Flush-to-zero and Denormals-are-zero at any -O level (-O0, -O1, -O2, etc).
To turn denormals on, use: -no-ftz option of intel compiler (docs1) (docs2) or may be -fp-model precise
In GCC denormals-are-zero is turned only by -ffast-math option, which is not set by any of -O1, -O2, -O3: (grep a -ffast-math). The -ffast-math includes denormals ignoring (bug36821,comment#1)
So, if you have a lot of denormals in COMP, ICC will ignore them as zero, and GCC will doing a lot of software handling.
It is possible that denormals are not the case, but other floating-point handling difference is.

Related

while loop getting stuck - Openmp

I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).

Gcc autovectorization weird behaviour in matrix multiply when arrays are function parameters

I'm benchmarking different matrix multiply forms with different optimization levels (for teaching purposes) and I detected a strange behavior in gcc autovectorization. It fails to vectorize when arrays are parameters (see mxmp) but is able to vectorize when arrays are global variables (see mxmg)
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
but behaviour was the same with older gcc versions
Compiling options:
gcc -O3 -mavx2 -mfma
#define N 1024
float A[N][N], B[N][N], C[N][N];
void mxmp(float A[N][N], float B[N][N], float C[N][N]) {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
void mxmg() {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
main(){
mxmg();
mxmp(A, B, C);
}
I expected the compiler to do the same in both functions however mxmp requires about 10 times the execution time of mxmg. Exploring the assembly code it just happens that gcc is able to autovectorize mxmg (when arrays are global variables) but fails to vectorize mxmp (where arrays are parameters).
Tried the same with kij form and it's able to vectorize both functions.
I need help to discover why gcc has this behavior. And how to help gcc (pragmas, compile options, atributes, ...) to properly vectorize mxmp function.
Thanks
When the arrays are global, the compiler can easily see that they are disjoint memory regions. When they are function parameters, you could call mxmp(A,A,A), so it has to assume that writing to C may modify A or B, which could affect later iterations and complicates vectorization. Of course the compiler could inline or do other things to know it in your particular case...
You can explicitly specify the lack of aliasing with restrict:
void mxmp(float A[restrict N][N], float B[restrict N][N], float C[restrict N][N]) {

Simple openmp call for loop not working

I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.
This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}

How can I get my CPU's branch target buffer(BTB) size?

It's useful when execute this routine when LOOPS > BTB_SIZE,
eg,
from
int n = 0;
for (int i = 0; i < LOOPS; i++)
n++;
to
int n = 0;
int loops = LOOPS / 2;
for(int i = 0; i < loops; i+=2)
n += 2;
can reduce branch misses.
BTB ref:http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html but it doesn't tell how to get the BTB size.
Any modern compiler worth its salt should optimise the code to int n = LOOPS;, but in a more complex example, the compiler will take care of such optimisation; see LLVM's auto-vectorisation, for instance, which handles many kinds of loop unrolling. Rather than trying to optimise your code, find appropriate compiler flags to get the compiler to do all the hard work.
From the BTB's point of view, both versions are the same. In both versions (if compiled unoptimized) there is only one conditional jump (each originating from the i<LOOPS), so there is only one jump target in the code, thus only one branch target buffer is used. You can see the resulting assembler code using Matt Godbolt's compiler explorer.
There would be difference between
for(int i=0;i<n;i++){
if(i%2==0)
do_something();
}
and
for(int i=0;i<n;i++){
if(i%2==0)
do_something();
if(i%3==0)
do_something_different();
}
The first version would need 2 branch target buffers (for for and for if), the second would need 3 branch target buffers (for for and for two ifs).
However, how Matt Godbolt found out, there are 4096 branch target buffers, so I would not worry too much about them.

OpenMP in Ubuntu: parallel program works on double core processor in two times slower than single-threaded. Why?

I get the code from wikipedia:
#include <stdio.h>
#include <omp.h>
#define N 100
int main(int argc, char *argv[])
{
float a[N], b[N], c[N];
int i;
omp_set_dynamic(0);
omp_set_num_threads(10);
for (i = 0; i < N; i++)
{
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
printf ("%f\n", c[10]);
return 0;
}
I tryed to compile and run it in my Ubuntu 11.04 with gcc4.5 (my configuration: Intel C2D T7500M 2.2GHz, 2048Mb RAM) and this program worked in two times slower than single-threaded. Why?
Very simple answer: Increase N. And set the number of threads equal to the number processors you have.
For your machine, 100 is a very low number. Try some orders of magnitudes higher.
Another question is: How are you measuring the computation time? Usually one takes the program time to get comparable results.
I suppose the compiler optimized the for loop in the non-smp case (using SSE instructions, e.g.) and it can't in the OMP variant.
Use gcc -S (or objdump -S) to view the assembly for the different variants.
You might want to watch out with the shared variables anyway, because they need to be synchronized, making things very slow. If you can 'smart' chunks (look at the schedule pragma) you might reduce the contention, but again:
verify the emitted code
profile
don't underestimate the efficiency of singlethreaded code (because of cache locality and lack of context switches)
set the number of threads to the number of CPUs (let openMP decide it for you!); unless your thread-team has a master thread with dedicated tasks, in which case there might be value in allocating ONE extra thread
In all the cases where I tried to apply OMP for parallelization, roughly 70% of the cases are slower. The cases where it is a definite speedup is with
coarse-grained parallellism (your sample is on the fine-grained end of the spectrum)
no shared data
The issue you are facing is false memory sharing. Each thread should have its own private c[i].
Try this: #pragma omp parallel shared(a, b) private(i, c)
Run the code below and see the difference.
1.) OpenMP has an overhead so the runtime has to be more than the overhead to see a benefit.
2.) Don't set the number of threads yourself. In general I use the default threads. However, if your processor has hyper-threading you might get a bit better performance by setting the number of threads equal to the number of cores. With hyper threading the default number of threads will be twice the number of cores. For example on my machine I have four cores and the default number of threads is eight. By setting it to four in some situations I get better results and in other cases I get worse results.
3.) There is some false sharing in c but as long as N is large enough (which it needs to be to overcome the overhead) the false sharing will not cause much of a problem. You can play with the chunk size but I don't think it will be helpful.
4.) Cache issues. You have at least four levels of memory (the values are for my system): L1 (32Kb), L2(256Kb), L3(12Mb), and main memory (>>12Mb). The benefits of parallelism are going to diminish as you move into higher level. However, in the example below I set N to 100 million floats which is 400 million bytes or about 381Mb and it is still significantly faster using multiple threads. Try adjusting N and see what happens. For example try setting N to your cache levels/4 (one float is 4 bytes) (arrays a and b also need to be in the cache so you might need to set N to the cache level/12). However, if N is too small you fight with the OpenMP overhead (which is what the code in your question does).
#include <stdio.h>
#include <omp.h>
#define N 100000000
int main(int argc, char *argv[]) {
float *a = new float[N];
float *b = new float[N];
float *c = new float[N];
int i;
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
double dtime;
dtime = omp_get_wtime();
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
dtime = omp_get_wtime();
#pragma omp parallel for private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
dtime = omp_get_wtime() - dtime;
printf ("time %f, %f\n", dtime, c[10]);
return 0;
}

Resources