Eigen Linear Algebra library is too slow - eigen

I have a positive definite matrix and I want to use Cholesky to solve it. I use Eigen and do this:
VectorXd x1 =A1.ldlt().solve(B1); //35.7ms
Over 100 runs, the average is 35.7ms. Is this normal? My own code can execute in 10ms, using Red Hat and a Intel i7-3770 3.4 GHz.
How can I speed it up? The matrix size is 241x241 and the vector 241.

Make sure you enabled compiler's optimizations. For instance, for gcc or clang you might use:
-O2 -msse2 -DNDEBUG
For such a matrix size Eigen's take 0.8ms on a core i7 # 2.6 GHz.

Related

MKL FFT performance on Intel Xeon 6248 - abrupt variations

I am working on an application which requires to Fourier Transform batches of 2-dimensional signals, stored using single-precision complex floats.
I wanted to test the idea of dissecting those signals into smaller ones and see whether I can improve the efficiency of my computation, considering that FLOPS in FFT operations grow in an O(Nlog(N)) fashion. Of course different signal sizes (in memory) may experience difference FLOPS/s performance, so in order to really see if this idea can work I made some experiments.
What I observed after doing the experiments was that performance was varying very abruptly when changing the signal size, jumping for example from 60 Gflops/s to 300 Gflops/s! I am wondering why is that the case.
I ran the experiments using:
Compiler: g++ 9.3.0 ( -Ofast )
Intel MKL 2020 (static linking)
MKL-threading: GNU
OpenMP environment:
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=20
Platform:
Intel Xeon Gold 6248
https://ark.intel.com/content/www/us/en/ark/products/192446/intel-xeon-gold-6248-processor-27-5m-cache-2-50-ghz.html
Profiling tool:
Score-P 6.0
Performance results:
To estimate the average FLOP rates I assume: # of Flops = Nbatch * 5*N*N*Log_2( N*N )
When using batches of 2D signals of size 201 x 201 elements (N = 201), the observed average performance was approximately 72 Gflops/s.
Then, I examined the performance using 2D signals with N = 101, 102, 103, 104 or 105. The performance results are shown on the figure below.
I also examined experiments with smaller size such as N = 51, 52, 53, 54 or 55. The results are again shown below.
An finally, for N = 26, 27, 28, 29 or 30.
I performed the experiments two times and the performance results are consistent! I really doubt it is noise... but again I feel is quite unrealistic to achieve 350 Gflops/s, or maybe not???
Has anyone experienced similar performance variations, or have some comments on this?
You can use FFT from either Intel MKL or Intel IPP(Intel® Integrated Performance Primitives
) libraries. So as mentioned earlier in the comments section, the article link which I have given helps to determine which library is best suited for your application.
If you are working on applications that are related to engineering, scientific and financial applications you can go with the Intel MKL library, and if you are working with imaging, vision, signal, security, and storage applications, the Intel IPP library helps in speed performance.
Intel® MKL is suitable for large problem sizes typical to FORTRAN and C/C++ high-performance computing above-mentioned applications for MKL.
Intel® IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications
For complete details please refer
https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-ipp-choosing-an-fft.html
https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top.html
https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.html

Xcode Apple Clang enable avx512

In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.
#ifdef __SSE4_1__
#error "sse4_1"
#endif
#ifdef __AVX__
#error "avx"
#endif
#ifdef __AVX2__
#error "avx2"
#endif
#ifdef __AVX512__
#error "avx512"
#endif
In default Build Settings, SSE4_1 is active, but avx, avx2 and is not. When I add -mavx in Building Settings-->Apple Clang-Custom Compiler Flags-->Other C Flags, that enable AVX, further adding -mavx2 to enable AVX and AVX2, but Unknow argument: '-mavx512'.
How do you enable avx512 and detect it?
It seems like there are few Macro to detect avx512.
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
What's differences between them?
AVX512 isn't a single extension, and doesn't have a specific-enough meaning in this context to be useful. Compilers only deal with specific CPU features, like AVX512F, AVX512DQ, AVX512CD, etc.
All CPUs that support any AVX512 extensions must support AVX512F, the "Foundation". AVX512F is the baseline AVX512 extension that other AVX512 extensions build on.
In code that wants to use AVX512 intrinsics, you should look at https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 and pick a set of extensions that are available together on one CPU you care about, e.g. F + CD and VL, DQ, BW on currently-available Skylake-X.
Then for example use #if defined(__AVX512BW__) && defined(__AVX512VL__) before code that uses vpermt2w on 256-bit vectors or something. __AVX512(anything)__ implies __AVX512F__; that's the one extension you don't have to check for separately.
But if you only used AVX512F instructions, they yeah just check for that macro.
You should pretty much never use -mavx512f directly: use -march=skylake-avx512, -march=knl, or -march=native. Or in future, -march=icelake or whatever.
The compiler knows which CPUs support which sets of extensions (or can detect which extensions the machine you're compiling on supports). There are a lot of them, and leaving out important ones like AVX512VL (support for AVX512 instructions on 128-bit and 256-bit vectors) or Xeon Phi's AVX512ER (fast 1/x and 1/sqrt(x) with twice the precision of the normal AVX512 14-bit versions) could hurt performance significantly. Especially AVX512ER is very important if you do any division or log/exp on Xeon Phi, because full-precision division is very slow on KNL compared to Skylake.
-march=x implies -mtune=x, enabling tuning options relevant for the target as well. KNL is basically Silvermont with AVX512 bolted on, and has significant differences from -mtune=skylake-avx512.
These are the same reasons you should generally not use -mfma -mavx2 directly, except that there are currently no AMD CPUs with AVX512, so there are only 2 main tuning targets (Xeon Phi and mainstream Skylake/CannonLake/Icelake), and they also support different sets of AVX512 extensions. There is unfortunately no -mtune=generic-avx2 tuning setting, but Ryzen supports almost all extensions that Haswell does (and the ones it doesn't GCC / clang won't use automatically, like transactional memory), so -march=haswell might be reasonable to make code tuned for CPUs with FMA, AVX2, popcnt, etc, without suffering too much on Ryzen.
Also relevant (for GCC, maybe not clang currently. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html):
-mprefer-vector-width=256 auto-vectorize with 256-bit vectors by default, in case most of the time is spent in non-vectorized loops. Using 512-bit vectors reduces the max turbo clock speed by a significant amount on Intel Xeon CPUs (maybe not as much on i9 desktop versions of Skylake-X), so it can be a net slowdown to use 512-bit vectors in small scattered bits of your program. So 256 is the default for tune=skylake-avx512 in GCC, but KNL uses 512.
-mprefer-avx-128 the old version of the -mprefer-vector-width= option, before AVX512 existed.
Using AVX512 mask registers, 32 vector registers, and/or its new instructions, can be a significant win even at the same vector width, so it makes sense to enable AVX512 even if you don't want to use 512-bit vector width. (Although sometimes code using intrinsics or auto-vectorization will compile in a worse way, instead of better, if AVX512 compare-into-register versions of comparison are available at all. But hopefully anti-optimization bugs like that will be sorted out as AVX512 becomes more widely used.)

Accelerator restriction: unsupported operation: RSQRTSS

I have a simple nbody implementation code and try to compile it for launching on NVIDIA GPUs (Tesla K20m/Geforce GTX 650 Ti). I use the following compiler options:
-Minfo=all -acc -Minline -Mfpapprox -ta=tesla:cc35/nvidia
Everything works without -Mfpapprox, but when I use it, the compilation fails with the following output:
346, Accelerator restriction: unsupported operation: RSQRTSS
The 346 line writes as:
float rdistance=1.0f/sqrtf(drSquared);
where
float drSquared=dx*dx+dy*dy+dz*dz+softening;
and dx, dy, dz are float values. This line is inside the #pragma acc parallel loop independent for() construction.
What is the problem with -Mfpapprox?
-Mfpapprox tells the compiler to use very low-precision CPU instructions to approximate DIV or SQRT. These instructions are not supported on the GPU. The GPU SQRT is both fast and precise so no need for a low-precision version.
Actually even on the CPU, I'd recommend you not use -Mfpapprox unless you really understand the mathematics of your code and it can handle a high degree of imprecision (as much as 5-6 bits or ~20Ulps off). We added this flag about 10 years ago since at the time the CPUs divide operation was very expensive. However, CPU performance for divide has greatly improved since then (as has sqrt) so you're generally better off not sacrificing precision for the little bit of speed-up you might get from this flag.
I'll put in an issue report requesting that the compiler ignore -Mfpapprox for GPU code so you wont see this error.

Memory error when using OpenMP with Fortran, running FFTW

I am testing FFTW in a fortran program, because I need to use it. Since I am working with huge matrixes, my first solution is to use OpenMP. When my matrix has dimension 500 x 500 x 500, the following error happens:
Operating system error:
Program aborted. Backtrace:
Cannot allocate memory
Allocation would exceed memory limit
I compiled the code using the following: gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp
PROGRAM test_fftw
USE omp_lib
USE, intrinsic:: iso_c_binding
IMPLICIT NONE
INCLUDE 'fftw3.f'
INTEGER::i, DD=500
DOUBLE COMPLEX:: OUTPUT_FFTW(3,3,3)
DOUBLE COMPLEX, ALLOCATABLE:: A3D(:,:,:), FINAL_OUTPUT(:,:,:)
integer*8:: plan
integer::iret, nthreads
INTEGER:: indiceX, indiceY, indiceZ, window=2
!! TESTING 3D FFTW with OPENMP
ALLOCATE(A3D(DD,DD,DD))
ALLOCATE(FINAL_OUTPUT(DD-2,DD-2,DD-2))
write(*,*) '---------------'
write(*,*) '------------TEST 3D FFTW WITH OPENMP----------'
A3D = reshape((/(i, i=1,DD*DD*DD)/),shape(A3D))
CALL dfftw_init_threads(iret)
CALL dfftw_plan_with_nthreads(nthreads)
CALL dfftw_plan_dft_3d(plan, 3,3,3, OUTPUT_FFTW, OUTPUT_FFTW, FFTW_FORWARD, FFTW_ESTIMATE)
FINAL_OUTPUT=0.
!$OMP PARALLEL DO DEFAULT(SHARED) SHARED(A3D,plan,window) &
!$OMP PRIVATE(indiceX, indiceY, indiceZ, OUTPUT_FFTW, FINAL_OUTPUT)
DO indiceZ=1,10!500-window
write(*,*) 'INDICE Z=', indiceZ
DO indiceY=1,10!500-window
DO indiceX=1,10!500-window
CALL dfftw_execute_dft(plan, A3D(indiceX:indiceX+window,indiceY:indiceY+window, indiceZ:indiceZ+window), OUTPUT_FFTW)
FINAL_OUTPUT(indiceX,indiceY,indiceZ)=SUM(ABS(OUTPUT_FFTW))
ENDDO
ENDDO
ENDDO
!$OMP END PARALLEL DO
call dfftw_destroy_plan(plan)
CALL dfftw_cleanup_threads()
DEALLOCATE(A3D,FINAL_OUTPUT)
END PROGRAM test_fftw
Notice this error occurs when I just use a huge matrix(A3D) without running the loop in all the values of this matrix (for running in all values, I should have the limits of the three (nested) loops as 500-window.
I tried to solve this(tips here and here) with -mcmodel=medium in the compilation without success.
I had success when I compiled with gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp -fmax-stack-var-size=65536
So, I don't understand:
1) Why there is memory allocation problem, if the huge matrix is a shared variable?
2) The solution I found is going to work if I have more huge matrix variables? For example, 3 more matrixes 500 x 500 x 500 to store calculation results.
3) In the tips I found, people said that using allocatable arrays/matrixes would solve, but I was using without any difference. Is there anything else I need to do for this?
Two double complex arrays with 500 x 500 x 500 elements require 4 gigabytes of memory. It is likely that the amount of available memory in your computer is not sufficient.
If you only work with small windows, you might consider not using the whole array at the whole time, but only parts of it. Or distribute the computation across multiple computers using MPI.
Or just use a computer with bigger RAM.

Which[neon/vfp/vfp3] should I specify for the mfpu when evaluate and compare float performance in ARM processor?

I want to evaluate some different ARM Processor float performance. I use the lmbench and pi_css5, I confuse in the float test.
From cat /proc/cpuinfo(below), I guess there're 3 types of float features: neon,vfp,vfpv3? From this question&answer, it seems it's depend to the compiler.
Still I don't know which I should to specify in compille flag(-mfpu=neon/vfp/vfpv3), or I should compile the program with each of that, or just do not specify the -mfpu?
cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 532.00
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 4
It might be even a little bit more complicated then you anticipated. GCC arm options page doesn't explain fpu versions, however ARM's manual for their compiler does. You should also notice that Linux doesn't provide whole story about fpu features, only telling about vfp, vfpv3, vfpv3d16, or vfpv4.
Back to your question, you should select the greatest common factor among them, compile your code towards it and compare the results. On the other hand if a cpu has vfpv4 and other has vfpv3 which one would you think is better?
If your question is as simple as selecting between neon, vfp or vfpv3. Select neon (source).
-mfpu=neon selects VFPv3 with NEON coprocessor extensions.
From the gcc manual,
If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations will
not be used by GCC's auto-vectorization pass unless
`-funsafe-math-optimizations' is also specified. This is because
NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are
treated as zero), so the use of NEON instructions may lead to a
loss of precision.
See for instance, Subnormal IEEE-754 floating point numbers support on ios... for more on this topic.
I have tried each one of them, and it seems using the -mfpu=neon and to specify the -march=armv7-a and -mfloat-abi=softfp is the proper configuration.
Besides, a referrence(ARM Cortex-A8 vs. Intel Atom) is of great useful for ARM BenchMark.
Another helpful article is about ARM Cortex-A Processors and gcc command lines, this clears the SIMD coprocessor configuration.

Resources