C++ Compilation flags for an R package in Windows/Mac - c++11

I developed an R package which calls C++ code through Rcpp and RcppEigen. My Makevars.win looks like this (the enumeration is meant to refer to my questions)
CXX_STD = CXX11
PKG_CPPFLAGS = -fopenmp -O3 -Wall -ftree-vectorize -march=native -mavx -mfma
PKG_CXXFLAGS += $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = -fopenmp
PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS)
PKG_CPPFLAGS += -I../inst/include/
as I want to use OpenMP and link the R package against Intel MKL library. I am also adding in my source files the plugins // [[Rcpp::plugins(cpp11)]] and // [[Rcpp::plugins(openmp)]].
When I compile the package everything works fine but I am still getting the default compilation flags -O2 and -std=c++0x. So my questions are:
A. isn't 1. supposed to force -std=c++11 (by the way, using the same Makevars yields the right C++ version, so there must be something specific to Windows)?
B. does 3 repeats fopenmp in 2?
C. how to check whether 5. has been taken into account? I am asking this as the same package built on Mac is much faster than on Windows while their configurations are the same. I have done some benchmark of the same code on Windows using Microsoft R Open and Mac, and Windows was faster in that case.
Thank you very much for your very precious help.

Where to start?
First off, compilation and linking options are based on the union of R's Makeconf and you package's src/Makevars. You can add to value, you cannot replace.
Second, and related, which BLAS you get is a system setup issue. You cannot generally govern that from your package.
Third, plugins for sourceCpp() and cppFunction(). In packages you make direct declarations, ie CXX_STD=CXX11.
Fourth, there are almost 1000 packages on CRAN using Rcpp. Sometimes it helps just to look at what some of these do. Many employ OpenMP.
Fifth, OpenMP is severely challenging on OS X thanks to Apple. I've forgotten what the Windows situation is. It just works on Linux.

Related

ALSA external plugin and openmp in C++

I'm creaating an ALSA external module using the gtkIOStream ALSAExternalPlugin class.
In my external plugin code, I am calling the necessary openmp calls :
omp_set_num_threads(omp_get_max_threads());
printf("omp_get_num_threads()=%d\n", omp_get_num_threads());
I am also compiling with the necessary openmp flags and libraries (-fopenmp and -gomp).
However when I run my code using "aplay -DexternalPlugin file" the system reports only one thread in use instead of 20 threads.
Am I missing something ?
The linking flags for compiling the external plugin are like so :
-fopenmp -lgomp -module -avoid-version -export-dynamic -no-undefined
-fopenmp is also in the CPP flags and I can see them at compile time.
Setting the number of threads does not make your code go parallel, so, as written, you are setting the number of threads which will be used by the next parallel region, and then printing the number of threads currently in use, which will, indeed, be one, since you haven't gone parallel.
In general, there is no point in forcing the number of threads, since any sane OpenMP runtime (certainly GCC and LLVM) will use all of the available threads by default.
Just print omp_get_max_threads() to see what will be used.
Of course, looking at machine load externally when running your code is also a way to check this!

AVX512 and MSVC preprocessor symbol

According to this link there are no predefined preprocessor symbols for AVX512 ( MSVC 2017 )
I'm trying to build thundersvm which uses eigen library on (you guessed it) windows. Both Eigen and thundersvm use cmake and depinding on the compiler prerpocessor symbols, Eigen compiles with avx512 instructions or not.
It seems that using /arch:AVX512 doesn't trigger any errors in MSVC but doesn't define __AVX512F__ symbol which Eigen needs. I also tried to include -D__AVX512F__=ON in the cmake arguments but still no luck.
Since there is no predefined preprocessor symbol for AVX512, is there any way to force Eigen to compile with avx512?
Update
According to chtz comment I've checked out the default branch of Eigen and recompiled thundersvm with arch:AVX512 with this cmake arguments (maybe not all are needed):
-DUSE_CUDA=OFF -DUSE_EIGEN=ON -DBUILD_SHARED_LIBS=OFF -DEIGEN_ENABLE_AVX512=ON -D__AVX512F__=ON -DEIGEN_VECTORIZE_AVX512=ON -DEIGEN_VECTORIZE_AVX2=ON -DEIGEN_VECTORIZE_AVX=ON -DEIGEN_VECTORIZE_FMA=ON
Comparing instruction mix from Intel's SDE -mix tool before and after the patch I can clearly see that AVX instructions are used (SDE complains it doesn't recognise instruction vbroadcastss zmm0, xmm0 when running for skl cpu but works fine for skx). The problem is that MSVC uses the scalar version of AVX and there is no improvement in the runtime(also the number of total instructions is the same) which is similar to this post
Are there other flags I need to define so that MSVC generates non scalar instrucions ? (I think I'll also give gcc a try)
MSVC has poor support for AVX-512 and no distinction between the different subsets. There is no safe way to produce AVX512F code on MSVC without also possibly making AVX512DQ instructions.
The best compilers for AVX-512 are gcc and clang. There is a Clang plugin to Visual Studio that you can use if you like the IDE. The gcc and clang compilers have preprocessor symbols like __AVX512F__, __AVX512VL__, etc.

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

iar ewarm linking to gcc eabi build library

I have been able to build code in IAR EWARM (7.40) (for the ST STM32F407IG ARM Cortex-m4) which links to a library built under Ubuntu via gcc (4.9.3). This mostly works but some build environment adjustments on either or both the IAR or gcc side still remain. I would appreciate whatever help you can point me to.
There are no build errors evident but EWARM and arm-none-eabi-gcc disagree on the locations of parameters being passed to the gcc built library. The EWARM debugger and the code generated by EWARM agree with each other but (it appears given investigations so far) that the locations expected by the gcc generated code are offset from those expected by EWARM by eight bytes. I've only investigated a single call, so this may not be constant...
IAR's compiler flags include: --aeabi and --guard_calls as per section: "AEABI compliance" in the EWARM help section.
arm-none-eabi-gcc compiler flags include: -gdwarf-3 -mabi=aapcs -march=armv7e-m -mthumb.
I believe this tells both EWARM and gcc to play nice together with ARM AAPCS standard procedure calls and dwarf v3 formats.
EWARM does seem to be happy with either -gdwarf-2 or -gdwarf-3 (but not -4). This selection does not appear to affect the issue discussed above.
What else is required?
The answer to "What else is required?" appears to be nothing. Just be darn sure that all of the macros evaluated by #ifdef statements match in the environments so you don't end up with different sized data structures in the two different environments! #ifdef code is header files should be carefully evaluated...

What does '-Olimit 2000' mean for cc

I try to compile an old program (which was compiled by cc) using gcc. In the makefile there is one line like this:
CFLAGS = -O2 -Olimit 2000 -w
There is no '-Olimit 2000' in gcc. I am wondering what does it really mean. Whether it is safe to just delete this option when using gcc.
As far as I can tell, this was only supported by IRIX's C compiler. I can't even find a solid reference as to what it was used for. Since it doesn't do anything with GCC, its definitely safe to remove it.
A little more detail, it was used to disable optimization on routines that were larger than the "Olimit". This limit is to make it so the amount of time doing optimization is limited. If you specify 0 for the Olimit, it means an "infinite Olimit" and will optimization every routine. Here's a man page for MIPSpro: http://cimss.ssec.wisc.edu/~gumley/modis/old/mips_64.pdf

Resources