Measure program execution time in nanoseconds - time

I am trying to measure execution time for my program in the nanosecond in g++ 4.2.2 in visual studio the library #inlcude is not being recognized by the compiler. I am not allowed to compile my program in any other compiler instead g++ 4.2.2
Is there any other options which I can use to measure the starting time and ending time of my program in nanoseconds.
This is what I am doing
int main(){
auto start= chrono::high_resolution_clock::now();
.....my program.....
auto end= chrono::high_resolution_clock::now();
cout<<chrono::duration_cast<chrono::nanoseconds>(end - start).count();
return 0;}

The chrono library is only available starting in c++11. Your compiler may be too old to use chrono. This link seems to indicate that g++4.3 was the earliest version of g++ to incorporate any c++11 features.
https://gcc.gnu.org/projects/cxx-status.html#cxx11
You should look at Boost. It will most likely have something you can use (i.e., boost::posix_time::nanoseconds).
http://www.boost.org/
http://www.boost.org/doc/libs/1_61_0/doc/html/date_time/posix_time.html#date_time.posix_time.time_duration

Related

ALSA external plugin and openmp in C++

I'm creaating an ALSA external module using the gtkIOStream ALSAExternalPlugin class.
In my external plugin code, I am calling the necessary openmp calls :
omp_set_num_threads(omp_get_max_threads());
printf("omp_get_num_threads()=%d\n", omp_get_num_threads());
I am also compiling with the necessary openmp flags and libraries (-fopenmp and -gomp).
However when I run my code using "aplay -DexternalPlugin file" the system reports only one thread in use instead of 20 threads.
Am I missing something ?
The linking flags for compiling the external plugin are like so :
-fopenmp -lgomp -module -avoid-version -export-dynamic -no-undefined
-fopenmp is also in the CPP flags and I can see them at compile time.
Setting the number of threads does not make your code go parallel, so, as written, you are setting the number of threads which will be used by the next parallel region, and then printing the number of threads currently in use, which will, indeed, be one, since you haven't gone parallel.
In general, there is no point in forcing the number of threads, since any sane OpenMP runtime (certainly GCC and LLVM) will use all of the available threads by default.
Just print omp_get_max_threads() to see what will be used.
Of course, looking at machine load externally when running your code is also a way to check this!

AVX512 and MSVC preprocessor symbol

According to this link there are no predefined preprocessor symbols for AVX512 ( MSVC 2017 )
I'm trying to build thundersvm which uses eigen library on (you guessed it) windows. Both Eigen and thundersvm use cmake and depinding on the compiler prerpocessor symbols, Eigen compiles with avx512 instructions or not.
It seems that using /arch:AVX512 doesn't trigger any errors in MSVC but doesn't define __AVX512F__ symbol which Eigen needs. I also tried to include -D__AVX512F__=ON in the cmake arguments but still no luck.
Since there is no predefined preprocessor symbol for AVX512, is there any way to force Eigen to compile with avx512?
Update
According to chtz comment I've checked out the default branch of Eigen and recompiled thundersvm with arch:AVX512 with this cmake arguments (maybe not all are needed):
-DUSE_CUDA=OFF -DUSE_EIGEN=ON -DBUILD_SHARED_LIBS=OFF -DEIGEN_ENABLE_AVX512=ON -D__AVX512F__=ON -DEIGEN_VECTORIZE_AVX512=ON -DEIGEN_VECTORIZE_AVX2=ON -DEIGEN_VECTORIZE_AVX=ON -DEIGEN_VECTORIZE_FMA=ON
Comparing instruction mix from Intel's SDE -mix tool before and after the patch I can clearly see that AVX instructions are used (SDE complains it doesn't recognise instruction vbroadcastss zmm0, xmm0 when running for skl cpu but works fine for skx). The problem is that MSVC uses the scalar version of AVX and there is no improvement in the runtime(also the number of total instructions is the same) which is similar to this post
Are there other flags I need to define so that MSVC generates non scalar instrucions ? (I think I'll also give gcc a try)
MSVC has poor support for AVX-512 and no distinction between the different subsets. There is no safe way to produce AVX512F code on MSVC without also possibly making AVX512DQ instructions.
The best compilers for AVX-512 are gcc and clang. There is a Clang plugin to Visual Studio that you can use if you like the IDE. The gcc and clang compilers have preprocessor symbols like __AVX512F__, __AVX512VL__, etc.

PowerPC GCC floating point instructions

Currently I am developing for a MPC5777 board, with e200z7 cores. Most of the things are going well, but I am stuck with a problem that is really annoying me already.
I am trying to use floating point operations on portions of my code, using the embedded hardware support. My toolchain is GCC 6.3 (powerpc-gcc), for which I am using the following flags:
ASFLAGS_BASE = -g -a32 -mbooke -me500 --fatal-warnings
ARCH_FLAGS = -mpowerpc-gpopt -mfprnd -misel -m32 -mhard-float -mabi=spe -mmfpgpr -mfloat-gprs=single
Please notice the -mfloat-gprs=single flag. That is the one that is giving problems.
When I use -mfloat-gprs=single, I am not able to compile things properly, as some functions are not implemented:
undefined reference to `__extendsfdf2`,
undefined reference to `__adddf3`,
undefined reference to `__divdf3`,
- among others.
Now, if I compile using -mfloat-gprs=double, it goes till the end and generate all my execution files. BUT, using this flag also generates extra functions, not implemented by the e200z7. I can't tell for sure all of them, as the code is getting bigger and it is mostly impossible to track all generated assembly. For instance, at the moment my execution gets stuck when it reaches the efscfd instruction, which is implemented by the e500 core, that has double precision floating point support, but not for the e200, that has single precision support only.
So, any piece of advice here would be amazingly welcome!
Thanks in advance,
In case someone ends up here with a similar problem, I have fixed it by using three flags:
-mfloat-gprs=single -Wdouble-promotion -fsingle-precision-constant
What they do is:
-mfloat-gprs=single tells the compiler to use general purpose register for floating point operations, instead of floating point registers. =single means it has single precision
-Wdouble-promotion enables a compiler warning for when gcc tries to convert a single float to a double float
-fsingle-precision-constant enforces GCC not to convert single float to double float
I am using -fsingle-precision-constant and -Wdouble-promotion at the same time to be 100% that double will not be used.

C++ Compilation flags for an R package in Windows/Mac

I developed an R package which calls C++ code through Rcpp and RcppEigen. My Makevars.win looks like this (the enumeration is meant to refer to my questions)
CXX_STD = CXX11
PKG_CPPFLAGS = -fopenmp -O3 -Wall -ftree-vectorize -march=native -mavx -mfma
PKG_CXXFLAGS += $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = -fopenmp
PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS)
PKG_CPPFLAGS += -I../inst/include/
as I want to use OpenMP and link the R package against Intel MKL library. I am also adding in my source files the plugins // [[Rcpp::plugins(cpp11)]] and // [[Rcpp::plugins(openmp)]].
When I compile the package everything works fine but I am still getting the default compilation flags -O2 and -std=c++0x. So my questions are:
A. isn't 1. supposed to force -std=c++11 (by the way, using the same Makevars yields the right C++ version, so there must be something specific to Windows)?
B. does 3 repeats fopenmp in 2?
C. how to check whether 5. has been taken into account? I am asking this as the same package built on Mac is much faster than on Windows while their configurations are the same. I have done some benchmark of the same code on Windows using Microsoft R Open and Mac, and Windows was faster in that case.
Thank you very much for your very precious help.
Where to start?
First off, compilation and linking options are based on the union of R's Makeconf and you package's src/Makevars. You can add to value, you cannot replace.
Second, and related, which BLAS you get is a system setup issue. You cannot generally govern that from your package.
Third, plugins for sourceCpp() and cppFunction(). In packages you make direct declarations, ie CXX_STD=CXX11.
Fourth, there are almost 1000 packages on CRAN using Rcpp. Sometimes it helps just to look at what some of these do. Many employ OpenMP.
Fifth, OpenMP is severely challenging on OS X thanks to Apple. I've forgotten what the Windows situation is. It just works on Linux.

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

Resources