OpenMP parallelizaiton in Linux using Makefile - openmp

I am new to the openmp parallelization, and will be thankful if someone can provide advice.
My target is to enable openmp parallelization and reduce the simulation timing by increasing the number of threads.
I am using intel compiler in linux to proceed with openmp parallelization by using a makefile.
The makefile is as following.
# for ifort
ifeq (${FC},ifort)
MKLROOT = /opt/intel/mkl/lib/intel64
FFLAGS += -I${MKLROOT}/include/intel64 -I${MKLROOT}/include
LDFLAGS += -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -liomp5 -lpthread -lm -ldl
endif
While I run this script, if I increase the number of threads, then there is no particular effect on the parallelization timing. I was wondering if there is any problem in the script.
Can someone guide please? Thank you very much.

Related

MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread)

I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
the second makefile looks very similar, I just added the -fopenmp flag;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?
edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S
edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.
edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.
It turned out that the problem was really task affinity, as suggested by Vladimir F and Gilles Gouaillardet (thank you very much!).
First I realized I was running MPI with OpenMPI version 1.6.4 and not MVAPICH2, so the command export MV2_ENABLE_AFFINITY=0 has no real meaning here. Second, I was (presumably) taking care of the affinity of different OpenMP threads by setting
export OMP_PROC_BIND=true
export OMP_PLACES=cores
but I was not setting the correct bindings for the MPI processes, as I was incorrectly launching the application as
mpirun -np 2 ./Par2S
and it seems that, with OpenMPI version 1.6.4, a more appropriate way to do it is
mpirun -np 2 -bind-to-core -bycore -cpus-per-proc 2 ./hParP2S
The command -bind-to-core -bycore -cpus-per-proc 2 assures 2 cores for my application (see https://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php and also Gilles Gouaillardet's comments on Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core). Without it, both MPI processes were going to one single core, which was the reason for the poor efficiency of the code when the flag -fopenmp was used in the Makefile.
Apparently, when running pure MPI code compiled without the -fopenmp flag different MPI processes go automatically to different cores, but with -fopenmp one needs to specify the bindings manually as described above.
As a matter of completeness, I should mention that there is no standard for setting the correct task affinity, so my solution will not work on e.g. MVAPICH2 or (possibly) different versions of OpenMPI. Furthermore, running nproc MPI processes with nthreads each in ncores CPUs would require e.g.
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=nthreads
mpirun -np nproc -bind-to-core -bycore -cpus-per-proc ncores ./hParP2S
where ncores=nproc*nthreads.
ps: my code has an MPI_all_to_all. The condition where more than one MPI process are on one single core (no hyperthreading) calling this subroutine should be the reason why the code was about 100 times slower.

Compiling in gfortran with makefile

I have received a bunch of .f95 files to be compiled. The only info included regarding its compilation is the order in which these has to be compiled and that the files are in free-form. Besides that there is a Makefile but it is a Makefile made for Intel Fortran Compiler. I know nothing about Fortran and just need to make use of the code. I do not have access to Intel Fortran Compiler and gfortran in macosx is my only available choice. I compiled similar code previously in a similar way and it worked fine. Nevertheless I get multiple errors and nothing happens.
As I said the MakeFile is not complex and is split in three main sections. How could I "translate" this to gfortran syntaxis and compile the code. Is there an equivalence of options between the two? I enclose an abridged version of the MakeFile.
Mine
% ifort -o BIN1.exe -O3 -diag-disable 8291 file1.f90 file2.f90 ....
% ifort -g -check bounds -o BIN1n.exe -O3 file1.f90 file2.f90 ....
% ifort -g -debug full -traceback -check bounds -check uninit -check pointers -check output_conversion -check format -warn alignments -warn truncated_source -warn usage -ftrapuv -fp-stack-check -fpe0 -fpconstant -vec_report0 -diag-disable 8291 -warn unused -o BIN.exe -O3 file1.f90 file2.f90 ....
You need to convert the ifort flags to gfortran flags. To the best of my knowledge this can only be done by reading the documentation of ifort and gfortran. I'm no expert but:
maybe -fpconstant can be replaced by -fdefault-real-8 (if I understand correctly this gfortran flag has the effect of the ifort flags -r8 and -fpconstant),
maybe -fpe0 can be replaced by using -ffpe-trap=XXX.
PS: You can find some equivalence at the page Compiling with gfortran instead of ifort

How can a segfault happen at runtime only because of linking unused modules?

I get a segmentation fault from a memory allocation statement just because I have linked some unrelated procedures to the binary.
I have a very simple Fortran program:
program whatsoever
!USE payload_modules
double precision,allocatable:: Vmat(:,:,:)
allocate(Vmat(2,2,2))
Vmat=1
write(*,*) Vmat
deallocate (Vmat)
! some more lines of code using procedures from payload_module
end program whatsoever
Compiling this using gfortran whatsoever.f95 -o whatsoever leads to a program with the expected behaviour. Of course, this program is not made to print eight times 1.000 but to call the payload_modules, yet hidden in the comments. However, if I compile and link the program with the modules issuing
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload_module1.f90 payload_module2.f90 whatsever.f95
gcc -g -nostdlib -v -Wl,--verbose -std=gnu99 -shared -Wl,-Bsymbolic-functions \
-Wl,-z,relro -o whatsoever whatsoever.o payload_module1.o payload_module2.o
the program whatsoever doesn't run any more. I get a segmentation fault at the allocate statement. I have not yet uncommented the lines related to the modules (however, uncommenting them leads to the same behaviour)!
I know that the payload modules' code is not buggy because I ran it before from R and wrapped this working code into a f90-module. There are no name collisions; nothing in the modules is called Vmat. There is only one other call to allocate in the modules. It never caused any trouble. There is still plenty of memory left. gdb didn't give me any hints expect a memory address.
How can linking routines that are actually not called crash a program?
Compiling your code with
gfortran whatsoever.f95 -o whatsoever
is working because you link against the system libraries, everything is in place. This would correspond to
gfortran whatsoever.f95 payload_module1.f90 payload_module2.f90 -o whatsoever
which would also work. The commands you used instead omit the system libraries, and the code fails at the first time you call a function from there (the allocation). You don't see that you are missing the libraries, because you create a shared object (which is typically linked against the libraries later on).
You chose to separate compiling the objects and linking them into an executable. Doing this for Fortran program using gcc you need to specify the Fortran libraries, so there's a -lgfortran missing.
I'm not sure about that particular choice of compile options... -shared is usually used for libraries, are you sure you want a shared binary (whatever that is)?
With -nostdlib you tell the compiler not to link against the system libraries. You would then need to specify those libraries (which you don't).
For the main program test.F90 and a module payload.F90, I run
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload.F90 test.F90
gcc -g -v -Wl,--verbose -std=gnu99 -Wl,-Bsymbolic-functions \
-Wl,-z,relro -lgfortran -o whatsoever test.o payload.o
This compiles and executes correctly.
It might be easier to use the advance options with gfortran:
gfortran -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none -Wl,-Bsymbolic-functions -Wl,-z,relro \
payload.F90 test.F90 -o whatsoever
The result is the same.

Optimization using gcc flags

I am new to C and I am trying to optimize my code for matrix multiplication. I am trying to use the gcc optimization flags but cannot figure out the syntax. Can anyone help?
gcc -O2 -funroll-loops -dgemm-blocked.c
Optimization Level flag name file name
What am I missing?

optimization and debugging options in Makefile

I wonder where to put the optimization and debugging options in Makefile: linking stage or compiling stage? I am reading a Makefile:
ifeq ($(STATIC),yes)
LDFLAGS=-static -lm -ljpeg -lpng -lz
else
LDFLAGS=-lm -ljpeg -lpng
endif
ifeq ($(DEBUG),yes)
OPTIMIZE_FLAG = -ggdb3 -DDEBUG
else
OPTIMIZE_FLAG = -ggdb3 -O3
endif
ifeq ($(PROFILE),yes)
PROFILE_FLAG = -pg
endif
CXXFLAGS = -Wall $(OPTIMIZE_FLAG) $(PROFILE_FLAG) $(CXXGLPK)
test: test.o rgb_image.o
$(CXX) $(CXXFLAGS) -o $# $^ $(LDFLAGS)
Makefile.depend: *.h *.cc Makefile
$(CC) -M *.cc > Makefile.depend
clean:
\rm -f absurdity *.o Makefile.depend TAGS
-include Makefile.depend
What surprises me is CXXFLAGS is used in linking. I know it is also used in the implicit rule for compiling to generate .o files but is it necessary to use it again for linking? Specifically, where should I put optimization and debugging: linking stage or compiling stage?
Short answer:
optimization: needed at compiler time
debug flag: needed at compile time
debugging symbols: need at both compile and linking time
Take note that the linker decides what bits of each object file and library need to be included in the final executable. It could throw out the debugging symbols (I don't know what the default behavior is), so you need to tell it not to.
Further, the linker will silently ignore options which do not apply to it.
To the comments:
The above are very general claims based on knowing what happens at each stage of compilation, so no reference.
A few more details:
optimization: takes two major forms: peephole optimization can occur very late, because it works on a few assembly instructions at a time (I presume that in the GNU tool chain the assembler is responsible for this step), but the big gains are in structural optimizations that are generally accomplished by re-writing the Abstract Syntax Tree (AST) which is only possible during compilation.
debug flag: In your example this is a preprocessor directive, and only affects the first part of the compilation process.
debugging symbols: Look up the ELF file format (for instance), you'll see that various bits of code and data are organized into different blocks. Debugging symbols are stored in the same file along as the code they relate to, but are necessarily kept separate from the actual code. As such, any program that manipulates these files could just dump it. Therefore both the compiler and the linker need to know if you want them or not.

Resources