ALSA external plugin and openmp in C++ - openmp

I'm creaating an ALSA external module using the gtkIOStream ALSAExternalPlugin class.
In my external plugin code, I am calling the necessary openmp calls :
omp_set_num_threads(omp_get_max_threads());
printf("omp_get_num_threads()=%d\n", omp_get_num_threads());
I am also compiling with the necessary openmp flags and libraries (-fopenmp and -gomp).
However when I run my code using "aplay -DexternalPlugin file" the system reports only one thread in use instead of 20 threads.
Am I missing something ?
The linking flags for compiling the external plugin are like so :
-fopenmp -lgomp -module -avoid-version -export-dynamic -no-undefined
-fopenmp is also in the CPP flags and I can see them at compile time.

Setting the number of threads does not make your code go parallel, so, as written, you are setting the number of threads which will be used by the next parallel region, and then printing the number of threads currently in use, which will, indeed, be one, since you haven't gone parallel.
In general, there is no point in forcing the number of threads, since any sane OpenMP runtime (certainly GCC and LLVM) will use all of the available threads by default.
Just print omp_get_max_threads() to see what will be used.
Of course, looking at machine load externally when running your code is also a way to check this!

Related

How to use GCC LTO with differently optimized object files?

I'm compiling an executable with arm-none-eabi-gcc for a Cortex-M4 based microcontroller. Non-performance-critical code is compiled with -Os (optimized for executable code size) and performance critical parts with another optimalization flags, eg. -Og / -O2 etc.
Is it safe to use -flto in such a build? If so, which optimalization flag should be passed to the linker?
According to the GCC documentation regarding optimise options:
It is recommended that you compile all the files participating in the same link with the same options
Such a statement is rather vague. Nevertheless, when digging into the release notes of GCC 5, there are some additional details:
Command-line optimization and target options are now streamed on a per-function basis and honored by the link-time optimizer. This change makes link-time optimization a more transparent replacement of per-file optimizations. It is now possible to build projects that require different optimization settings for different translation units (such as -ffast-math, -mavx, or -finline).
And also information about which flags are affected by such limitations and which aren't:
Note that this applies only to those command-line options that can be passed to optimize and target attributes. Command-line options affecting global code generation (such as -fpic), warnings (such as -Wodr), optimizations affecting the way static variables are optimized (such as -fcommon), debug output (such as -g), and --param parameters can be applied only to the whole link-time optimization unit. In these cases, it is recommended to consistently use the same options at both compile time and link time.
In your scenario, the optimisation flags -Og, -O2 and -Os can be passed as optimise attributes and do not fall into the cases where the compile time and link time flags ought to be the same. So yes, it should be safe to use -flto in such a build.
Regarding the optimisations flags passed at link time, as stated in the release notes:
Contrary to earlier GCC releases, the optimization and target options
passed on the link command line are ignored.
GCC automatically determines which optimisation level to use, which is the highest level used when compiling the object files. You therefore don't need to pass any of your -O optimisation options to the linker.

C++ Compilation flags for an R package in Windows/Mac

I developed an R package which calls C++ code through Rcpp and RcppEigen. My Makevars.win looks like this (the enumeration is meant to refer to my questions)
CXX_STD = CXX11
PKG_CPPFLAGS = -fopenmp -O3 -Wall -ftree-vectorize -march=native -mavx -mfma
PKG_CXXFLAGS += $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = -fopenmp
PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS)
PKG_CPPFLAGS += -I../inst/include/
as I want to use OpenMP and link the R package against Intel MKL library. I am also adding in my source files the plugins // [[Rcpp::plugins(cpp11)]] and // [[Rcpp::plugins(openmp)]].
When I compile the package everything works fine but I am still getting the default compilation flags -O2 and -std=c++0x. So my questions are:
A. isn't 1. supposed to force -std=c++11 (by the way, using the same Makevars yields the right C++ version, so there must be something specific to Windows)?
B. does 3 repeats fopenmp in 2?
C. how to check whether 5. has been taken into account? I am asking this as the same package built on Mac is much faster than on Windows while their configurations are the same. I have done some benchmark of the same code on Windows using Microsoft R Open and Mac, and Windows was faster in that case.
Thank you very much for your very precious help.
Where to start?
First off, compilation and linking options are based on the union of R's Makeconf and you package's src/Makevars. You can add to value, you cannot replace.
Second, and related, which BLAS you get is a system setup issue. You cannot generally govern that from your package.
Third, plugins for sourceCpp() and cppFunction(). In packages you make direct declarations, ie CXX_STD=CXX11.
Fourth, there are almost 1000 packages on CRAN using Rcpp. Sometimes it helps just to look at what some of these do. Many employ OpenMP.
Fifth, OpenMP is severely challenging on OS X thanks to Apple. I've forgotten what the Windows situation is. It just works on Linux.

How does gcc's linktime optimisation (-flto flag) work

I understand more or less the idea: When compiling separate modules and producing assembly code, functions calling each other have to respect strictly the calling convention, which kills the opportunity for many optimisations when compiling separate modules.
For instance if I have function A which calls function B which calls function C, all 3 in their own separate source files, it becomes possible to allocate registers evenly within the functions so that no register saving on the stack is necessary at all during those calls. With traditional compile-assembly-linking this is not possible, as the caller-saved and callee-saved registers are imposed by the calling convention.
Another optimisation is to inline functions which are called only once. This previously was possible only if a function is local, but thanks to linktime optimisation it's now possible even if the function is in another source file.
Now, if I compile with both -flto and -S flags, I see that instead of normal assembly instructions, gcc generates an encoded representation of the program, such as this:
.section .gnu.lto_.inline.c3c5e6ef8ec983c,"dr0"
.ascii "x\234mQ;N\303#\20}\273\353\17\370C\234\20\242`\"!Q\20\11Ah\322&\25\242\314\231|\4\32\220\220(,$.#\205D\343\3P Z.\341Tn\231\35\274\31L\342\342\355\314\274\371<\317\30\354\376\356\365\357\333\7\262"
.ascii "1\240G\325\273\202\7\216\232\204\36\205"
.ascii "8\242\370\240|\222"
.ascii "8\374\21\205ty\352\"*r\340!:!n\357n%]\224\345\10|\304\23\342\274z\346"
.ascii "8\35\23\370\7\4\1\366s\362\203j\271]\27bb{\316\353\27\343\310\4\371\374\237*n#\220\342rA\31"
.ascii "7\365\263\327\231\26\364\10"
.ascii "2\\-\311\277\255^w\220}|\340\233\306\352\263\362Qo+e+\314\354\277\246\354\252\277\20\364\224%T\233'eR\301{\32\340\372\313\362\263\242\331\314\340\24\6\21s\210\243!\371\347\325\333&m\210\305\203\355\277*\326\236\34\300-\213\327\306\2Td\317\27\231\26tl,\301\26\21cd\27\335#\262L\223"
.ascii "8\353\30\351\264{I\26\316\11\14"
.ascii "9\326h\254\220B}6a\247\13\353\27M\274\231"
.ascii "0\23M\332\272\272%d[\274\36Q\200\37\321\1&\35"
Since the data is in its own particular section, the linker sees this, and does the code generation. If the module was written in either assembly or with no -flto flag, then the linker would see data in the .text section instead, so there is no confusion possible for the linker.
The problem is: How can the linker generate code? Normally only gcc can generate code, the linker's role is just here to change a few offsets and adapt the binary format. In order to generate code, the linker would need to contain a second copy of the entire gcc backend (half of the compiler which generates assembly code from intermediate representation), as well as the entire assembler (since no assembly code was produced). How is such a thing possible, especially considering that binutils is a completely separate entity from gcc, developed by different teams?
GCC's -flto emits a serialized form of GCC's internal representation, as you discovered.
Then, at link time, the linker reinvokes GCC and passes it the objects that need final compilation. GCC reads the internal representation and does the work.
I think the actual work is done in collect2, which is part of GCC that is used when invoking the linker (I'm a little fuzzy on the details). There is also a "linker plugin" system that enables this to work a little better (like letting the linker decide how to split the compilation). This is implemented at least by the binutils ld and by gold; but as far as I recall this is just an optimization and isn't needed to get the basic -flto feature to work. You can see a bit more information on the original LTO project page; and maybe links from there would explain more.
There is more overlap between the GCC and binutils teams than you might think. The two projects share some code and have a long history of working together. Some people work on both projects.
From https://gcc.gnu.org/wiki/LinkTimeOptimization:
Despite the "link time" name, LTO does not need to use any special
linker features. The basic mechanism needed is the detection of GIMPLE
sections inside object files. This is currently implemented in
collect2 [which is called by gcc; -ps]. Therefore, LTO will work on any linker already supported by
GCC.
I assume this means you must link calling the compiler driver gcc. Simply linking with the system's vanilla linker wouldn't optimize the whole program, as you already concluded.
Update:
https://gcc.gnu.org/onlinedocs/gccint/Collect2.html says
The program collect2 is installed as ld in the directory where the
passes of the compiler are installed. When collect2 needs to find the
real ld it tries the following file names: [...]
(The page goes on detailing how collect2 looks for configuration-dependent executables and ones with well-known names like real-ld, finally even ld; but will not call itself recursively.)

iar ewarm linking to gcc eabi build library

I have been able to build code in IAR EWARM (7.40) (for the ST STM32F407IG ARM Cortex-m4) which links to a library built under Ubuntu via gcc (4.9.3). This mostly works but some build environment adjustments on either or both the IAR or gcc side still remain. I would appreciate whatever help you can point me to.
There are no build errors evident but EWARM and arm-none-eabi-gcc disagree on the locations of parameters being passed to the gcc built library. The EWARM debugger and the code generated by EWARM agree with each other but (it appears given investigations so far) that the locations expected by the gcc generated code are offset from those expected by EWARM by eight bytes. I've only investigated a single call, so this may not be constant...
IAR's compiler flags include: --aeabi and --guard_calls as per section: "AEABI compliance" in the EWARM help section.
arm-none-eabi-gcc compiler flags include: -gdwarf-3 -mabi=aapcs -march=armv7e-m -mthumb.
I believe this tells both EWARM and gcc to play nice together with ARM AAPCS standard procedure calls and dwarf v3 formats.
EWARM does seem to be happy with either -gdwarf-2 or -gdwarf-3 (but not -4). This selection does not appear to affect the issue discussed above.
What else is required?
The answer to "What else is required?" appears to be nothing. Just be darn sure that all of the macros evaluated by #ifdef statements match in the environments so you don't end up with different sized data structures in the two different environments! #ifdef code is header files should be carefully evaluated...

What does '-Olimit 2000' mean for cc

I try to compile an old program (which was compiled by cc) using gcc. In the makefile there is one line like this:
CFLAGS = -O2 -Olimit 2000 -w
There is no '-Olimit 2000' in gcc. I am wondering what does it really mean. Whether it is safe to just delete this option when using gcc.
As far as I can tell, this was only supported by IRIX's C compiler. I can't even find a solid reference as to what it was used for. Since it doesn't do anything with GCC, its definitely safe to remove it.
A little more detail, it was used to disable optimization on routines that were larger than the "Olimit". This limit is to make it so the amount of time doing optimization is limited. If you specify 0 for the Olimit, it means an "infinite Olimit" and will optimization every routine. Here's a man page for MIPSpro: http://cimss.ssec.wisc.edu/~gumley/modis/old/mips_64.pdf

Resources