MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread)

MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread) - makefile

I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
the second makefile looks very similar, I just added the -fopenmp flag;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?
edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S
edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.
edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.

It turned out that the problem was really task affinity, as suggested by Vladimir F and Gilles Gouaillardet (thank you very much!).
First I realized I was running MPI with OpenMPI version 1.6.4 and not MVAPICH2, so the command export MV2_ENABLE_AFFINITY=0 has no real meaning here. Second, I was (presumably) taking care of the affinity of different OpenMP threads by setting
export OMP_PROC_BIND=true
export OMP_PLACES=cores
but I was not setting the correct bindings for the MPI processes, as I was incorrectly launching the application as
mpirun -np 2 ./Par2S
and it seems that, with OpenMPI version 1.6.4, a more appropriate way to do it is
mpirun -np 2 -bind-to-core -bycore -cpus-per-proc 2 ./hParP2S
The command -bind-to-core -bycore -cpus-per-proc 2 assures 2 cores for my application (see https://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php and also Gilles Gouaillardet's comments on Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core). Without it, both MPI processes were going to one single core, which was the reason for the poor efficiency of the code when the flag -fopenmp was used in the Makefile.
Apparently, when running pure MPI code compiled without the -fopenmp flag different MPI processes go automatically to different cores, but with -fopenmp one needs to specify the bindings manually as described above.
As a matter of completeness, I should mention that there is no standard for setting the correct task affinity, so my solution will not work on e.g. MVAPICH2 or (possibly) different versions of OpenMPI. Furthermore, running nproc MPI processes with nthreads each in ncores CPUs would require e.g.
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=nthreads
mpirun -np nproc -bind-to-core -bycore -cpus-per-proc ncores ./hParP2S
where ncores=nproc*nthreads.
ps: my code has an MPI_all_to_all. The condition where more than one MPI process are on one single core (no hyperthreading) calling this subroutine should be the reason why the code was about 100 times slower.

Related

Compiling in gfortran with makefile

I have received a bunch of .f95 files to be compiled. The only info included regarding its compilation is the order in which these has to be compiled and that the files are in free-form. Besides that there is a Makefile but it is a Makefile made for Intel Fortran Compiler. I know nothing about Fortran and just need to make use of the code. I do not have access to Intel Fortran Compiler and gfortran in macosx is my only available choice. I compiled similar code previously in a similar way and it worked fine. Nevertheless I get multiple errors and nothing happens.
As I said the MakeFile is not complex and is split in three main sections. How could I "translate" this to gfortran syntaxis and compile the code. Is there an equivalence of options between the two? I enclose an abridged version of the MakeFile.
Mine
% ifort -o BIN1.exe -O3 -diag-disable 8291 file1.f90 file2.f90 ....
% ifort -g -check bounds -o BIN1n.exe -O3 file1.f90 file2.f90 ....
% ifort -g -debug full -traceback -check bounds -check uninit -check pointers -check output_conversion -check format -warn alignments -warn truncated_source -warn usage -ftrapuv -fp-stack-check -fpe0 -fpconstant -vec_report0 -diag-disable 8291 -warn unused -o BIN.exe -O3 file1.f90 file2.f90 ....

You need to convert the ifort flags to gfortran flags. To the best of my knowledge this can only be done by reading the documentation of ifort and gfortran. I'm no expert but:
maybe -fpconstant can be replaced by -fdefault-real-8 (if I understand correctly this gfortran flag has the effect of the ifort flags -r8 and -fpconstant),
maybe -fpe0 can be replaced by using -ffpe-trap=XXX.
PS: You can find some equivalence at the page Compiling with gfortran instead of ifort

So what "is" AM_(F)CFLAGS?

The autotools documentation is very confusing. I am writng Fortran, so the AM_CFLAGS equivalent is AM_FCFLAGS. The two work exactly the same way (presumably).
First off, what actually "is" AM_CFLAGS, conceptually? Clearly, the "CFLAGS" bit is to do with setting compiler flags. But what does the "AM_" part mean?
There seems to be conflicting advice as to how to use it. Some say don't put it in Makefile.am, and some say don't put it in configure.ac. Who is right?
Here is my current Makefile.am:
AM_FCFLAGS = -Wall -O0 -C -fbacktrace
.f90.o:
$(FC) -c $(AM_FCFLAGS) $<
What I want to happen is to compile with "-Wall -O0 -C -fbacktrace" by default if I'm compiling with gfortran. However, a user might want to use a different compiler, eg FC=ifort, in which case they'll probably have to pass in FCFLAGS="whatever" and completely scrap AM_FCFLAGS
Can the user also override the default AM_FCFLAGS from the configure option if they're still using gfortran?
Basically, WTF?

AM_FCFLAGS (and similarly AM_CFLAGS and similar) are designed to not be user-overridable, so you should not put those options there unless you want them to always be present.
Users can pass their own FCFLAGS as part of their ./configure call — what you can do, if you want to default to those rather than what autoconf will default by itself, is to change configure.ac and compare the default flags (which to be honest I don't know for Fortran) to the current FCFLAGS, and if they match, replace FCFLAGS with your defaults.

In Makefile.am I had
AM_CFCFLAGS = -Wall -O0 -C -fbacktrace
Bad idea! It assumes that folks are using gfortran and/or won't want to override those defaults. So I deleted that line.
Instead, I now have the following lines in configure.ac:
AC_PROG_FC([gfortran], [Fortran 90]) # we need a Fortran 90 compiler
AS_IF([test x$FC = xgfortran -a x$ac_cv_env_FCFLAGS_set = x],
AC_SUBST([FCFLAGS], ["-Wall -O0 -C -fbacktrace"])
[Set some flags automatically if using gfortran])
AC_PROG_FC checks for gfortran that meets with Fortran 90 standards, and automatically sets FC and FCFLAGS.
The last 3 lines set sensible defaults if the user is using gfortran, but hasn't set FCFLAGS.
I discovered about ac_cv_env_FCFLAGS_set when I looked at config.log. It is set to "set" if the user sets their own FCFLAGS.
In Makefile.am I now have rules like:
.f90.o:
$(FC) -c $(FCFLAGS) $<
datetime_module.mod : datetime.o
datetime.o : datetime.f90 mod_clock.o mod_datetime.o mod_strftime.o mod_timedelta.o
mod_clock.o: mod_clock.f90 mod_datetime.o mod_timedelta.o
mod_datetime.o: mod_datetime.f90 mod_constants.o mod_strftime.o mod_timedelta.o
mod_timedelta.o: mod_timedelta.f90
It's starting to make sense now.

How can a segfault happen at runtime only because of linking unused modules?

I get a segmentation fault from a memory allocation statement just because I have linked some unrelated procedures to the binary.
I have a very simple Fortran program:
program whatsoever
!USE payload_modules
double precision,allocatable:: Vmat(:,:,:)
allocate(Vmat(2,2,2))
Vmat=1
write(*,*) Vmat
deallocate (Vmat)
! some more lines of code using procedures from payload_module
end program whatsoever
Compiling this using gfortran whatsoever.f95 -o whatsoever leads to a program with the expected behaviour. Of course, this program is not made to print eight times 1.000 but to call the payload_modules, yet hidden in the comments. However, if I compile and link the program with the modules issuing
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload_module1.f90 payload_module2.f90 whatsever.f95
gcc -g -nostdlib -v -Wl,--verbose -std=gnu99 -shared -Wl,-Bsymbolic-functions \
-Wl,-z,relro -o whatsoever whatsoever.o payload_module1.o payload_module2.o
the program whatsoever doesn't run any more. I get a segmentation fault at the allocate statement. I have not yet uncommented the lines related to the modules (however, uncommenting them leads to the same behaviour)!
I know that the payload modules' code is not buggy because I ran it before from R and wrapped this working code into a f90-module. There are no name collisions; nothing in the modules is called Vmat. There is only one other call to allocate in the modules. It never caused any trouble. There is still plenty of memory left. gdb didn't give me any hints expect a memory address.
How can linking routines that are actually not called crash a program?

Compiling your code with
gfortran whatsoever.f95 -o whatsoever
is working because you link against the system libraries, everything is in place. This would correspond to
gfortran whatsoever.f95 payload_module1.f90 payload_module2.f90 -o whatsoever
which would also work. The commands you used instead omit the system libraries, and the code fails at the first time you call a function from there (the allocation). You don't see that you are missing the libraries, because you create a shared object (which is typically linked against the libraries later on).
You chose to separate compiling the objects and linking them into an executable. Doing this for Fortran program using gcc you need to specify the Fortran libraries, so there's a -lgfortran missing.
I'm not sure about that particular choice of compile options... -shared is usually used for libraries, are you sure you want a shared binary (whatever that is)?
With -nostdlib you tell the compiler not to link against the system libraries. You would then need to specify those libraries (which you don't).
For the main program test.F90 and a module payload.F90, I run
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload.F90 test.F90
gcc -g -v -Wl,--verbose -std=gnu99 -Wl,-Bsymbolic-functions \
-Wl,-z,relro -lgfortran -o whatsoever test.o payload.o
This compiles and executes correctly.
It might be easier to use the advance options with gfortran:
gfortran -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none -Wl,-Bsymbolic-functions -Wl,-z,relro \
payload.F90 test.F90 -o whatsoever
The result is the same.

Error: bad value for -march= switch

I wrote a Makefile and I can't get it to work. I have an option which is supposed to select which processor to compile to. However, when I run make from the commandline it says:
tandex#tandex-P-6860FX:~/emulators/nintendo sdks/3DS SDK [HomeBrew]$ make
gcc -march=arm7tdmi -static -fexceptions -fnon-call-exceptions -fstack-check test.c -c
test.c:1:0: error: bad value (arm7tdmi) for -march= switch
make: *** [ALL] Error 1
But in the man pages for gcc, it states that arm7tdmi is a permissible value. Am I missing something?
Makefile:
#3DS Compilation Makefile (c) TanDex (TEQ)RunawayFreelancers
#
#Version 0.99 (Alpha) For *nix Devices
#
#Please Check Back Soon for 3rd SDK
#SELECT THE COMPILER TO USE! GCC RECOMMENDED!
#FOR SANITY SAKE, USE C FILES WITH GCC AND CPP FILES WITH G++
CC=gcc
#CC=g++
#OBJECTCOPY REFERENCE, DO NOT REMOVE
OBJC=objcopy
OBJREFS= -O Binary
#SELECT THE PROCESSOR TO TUNE IT TO. ARMV7 (Nintendo DS) or ARMV9(Nintendo DS
(Graphical Support))
#or ARM11 Core ARM1176JZ-S and ARM1176JZF-S (3DS Processor? Not Sure if Correct. Try
and see if they Work?)
#
#NOTE: DS GAMES REQUIRE BOTH A ARM7 AND ARM9 BINARY. RUN THIS TWICE (ONCE FOR EACH)
#
#UNCOMMENT FOR PROCESOR
PROCESSOR=arm7tdmi
#PROCESSOR=arm946e-s
#PROCESSOR=arm1176jz-s
#PROCESSOR=arm1176jzf-s
#FILES
#
#PLACE ALL OF THE FILES HERE, THAT ARE BEING COMPILED!
FILES=test.c
#SET BIN FILE NAME BASED ON PROCESSOR SELECTED
ifeq($(PROCESSOR),arm7tdmi)\
NAME=ARM7.BIN
ifeq($(PROCESSOR), arm946e-s)\
NAME=ARM9.BIN
ifeq($(PROCESSOR), arm1176jz-s)\
NAME=ARM11.BIN
ifeq($(PROCESSOR), arm1176jzf-s)\
NAME=ARM11.BIN
#CREATE OBJECTS
ifeq($(CC), gcc)\
OBJECTS=$(FILES:.c=.o)
ifeq($(CC), g++)\
OBJECTS=$(FILES:.cpp=.o)
#FLAGS! DO NOT CHANGE THESE!!!!!!!!!!! THAT MEANS YOU!!!!!
#
#FOR THOSE WHO WANT TO KNOW WHAT THESE DO, HERE THEY ARE:
#-mtune=$(PROCESSOR) FORE THE COMPILER TO TUNE OUTPUT TO THE SPECIFIED
PROCESSOR
#-static REQUIRED FOR CLEAN BINARY OUTPUT?? (NOT SURE WHAT THIS
DOES, BUT WAS SUGESTED ON A POST ON STACKOVERFLOW)
#-fexceptions FORCE EXCEPTIONS
#-fnon-call-exceptions FORCE EXCEPTIONS TO ONLY BE RETURNED BY THE SYSTEM
(MEMORY AND FPU INSTRUTIONS FOR EXAMPLE)
#-fstack-check FORCE STACK CHECKING (DS / 3DS USE AWKWARD STACK
IMPLEMENTATION)
CFLAGS=-march=$(PROCESSOR) -static -fexceptions -fnon-call-exceptions -fstack-check
ALL:
$(CC) $(CFLAGS) $(FILES) -c
.c.o:
$(OBJC) $(OBJREFS) $(OBJECTS) $(NAME)
.cpp.o:
$(OBJC) $(OBJREFS) $(OBJECTS) $(NAME)

You are probably not calling the right gcc. You seem to be calling the gcc installed in your system, rather than the one that comes with the 3DS SDK.

It appears the problem is with -march=arm7tdmi.
I think the workaround is to avoid using -march=arm7tdmi; and use -march=cpu-type, where cpu-type is one of the ones listed at 3.17.4 ARM Options of the GCC manual.
Here's part of the page:
-march=name
This specifies the name of the target ARM architecture. GCC uses this name to determine what kind of instructions it can emit when
generating assembly code. This option can be used in conjunction with
or instead of the -mcpu= option. Permissible names are: ‘armv2’,
‘armv2a’, ‘armv3’, ‘armv3m’, ‘armv4’, ‘armv4t’, ‘armv5’, ‘armv5t’,
‘armv5e’, ‘armv5te’, ‘armv6’, ‘armv6j’, ‘armv6t2’, ‘armv6z’,
‘armv6kz’, ‘armv6-m’, ‘armv7’, ‘armv7-a’, ‘armv7-r’, ‘armv7-m’,
‘armv7e-m’, ‘armv7ve’, ‘armv8-a’, ‘armv8-a+crc’, ‘iwmmxt’, ‘iwmmxt2’,
‘ep9312’.

optimization and debugging options in Makefile

I wonder where to put the optimization and debugging options in Makefile: linking stage or compiling stage? I am reading a Makefile:
ifeq ($(STATIC),yes)
LDFLAGS=-static -lm -ljpeg -lpng -lz
else
LDFLAGS=-lm -ljpeg -lpng
endif
ifeq ($(DEBUG),yes)
OPTIMIZE_FLAG = -ggdb3 -DDEBUG
else
OPTIMIZE_FLAG = -ggdb3 -O3
endif
ifeq ($(PROFILE),yes)
PROFILE_FLAG = -pg
endif
CXXFLAGS = -Wall $(OPTIMIZE_FLAG) $(PROFILE_FLAG) $(CXXGLPK)
test: test.o rgb_image.o
$(CXX) $(CXXFLAGS) -o $# $^ $(LDFLAGS)
Makefile.depend: *.h *.cc Makefile
$(CC) -M *.cc > Makefile.depend
clean:
\rm -f absurdity *.o Makefile.depend TAGS
-include Makefile.depend
What surprises me is CXXFLAGS is used in linking. I know it is also used in the implicit rule for compiling to generate .o files but is it necessary to use it again for linking? Specifically, where should I put optimization and debugging: linking stage or compiling stage?

Short answer:
optimization: needed at compiler time
debug flag: needed at compile time
debugging symbols: need at both compile and linking time
Take note that the linker decides what bits of each object file and library need to be included in the final executable. It could throw out the debugging symbols (I don't know what the default behavior is), so you need to tell it not to.
Further, the linker will silently ignore options which do not apply to it.
To the comments:
The above are very general claims based on knowing what happens at each stage of compilation, so no reference.
A few more details:
optimization: takes two major forms: peephole optimization can occur very late, because it works on a few assembly instructions at a time (I presume that in the GNU tool chain the assembler is responsible for this step), but the big gains are in structural optimizations that are generally accomplished by re-writing the Abstract Syntax Tree (AST) which is only possible during compilation.
debug flag: In your example this is a preprocessor directive, and only affects the first part of the compilation process.
debugging symbols: Look up the ELF file format (for instance), you'll see that various bits of code and data are organized into different blocks. Debugging symbols are stored in the same file along as the code they relate to, but are necessarily kept separate from the actual code. As such, any program that manipulates these files could just dump it. Therefore both the compiler and the linker need to know if you want them or not.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio