Eigen + MKL uses single core for complex matrix (ZHEEV) - openmp

there is a weird behavior from mkl on our cluster. I am calling Eigen::SelfAdjointEigenSolverEigen::MatrixXcd for a complex matrix (ZHEEV).
When I calculate the eigenvectors for large matrices (dim >~ 100k) it only uses a single core.
Strangely, it runs perfectly fine (multiple cores) for smaller complex matrices, real matrices and large complex matrices (dim >~ 100k) without eigenvectors.
Did anyone face the same issue or has any idea what is going on in the background?
I tried various mkl versions.

This issue is well known with OpenBLAS and Netlib's reference Lapack due to poor zlasr performance ref. If you can, switch to zheevr or use Intel's MKL (it seems to take another route to avoid zlasr). MKL definitively don't exhibits this issue. (you didn't mention which versions of 'MKL' you've tried.)

Related

Eigendecompositions are 5 times slower in Julia than in Mathematica?

I am new to Julia and primarily work in Mathematica, so I probably have a few elementary mistakes floating around. I attempted to time how long Julia took to compute the eigensystem of a random matrix, and found it was 5-6 times slower than in Mathematica.
In Julia:
D=1000*(rand(1000,1000)-0.5);
#time (E,F)=eig(D);
Out: elapsed time: 7.47950706 seconds (79638920 bytes allocated*)
In Mathematica:
First#Timing#Eigensystem[RandomReal[{-500, 500}, {1000, 1000}]]
Out: 1.310408
For 2000 x 2000 arrays it's similar, although the Julia result slowed down slightly less than the equivalent Mathematica call, but it's still slower; Julia takes 22 seconds, whereas Mathematica computes it in 8 seconds.
As far as I read in the Julia standard library for linear algebra, decompositions are implemented by calling LAPACK, which I thought was supposed to be very good, so I'm confused as to why the Julia code is running so much slower. Does anyone know why this is the case? Is it doing some kind of balancing or array-symmetry-detection that Mathematica doesn't do? Or is it actually slower?
Also, this is a syntax question and probably a silly error, but how do you change the balancing in Julia? I tried
#time (E,F)=eig(D[, balance=:nobalance]);
exactly as copied and pasted from the Julia manual, but it just gave a syntax error, so something's wrong.
I am using Windows 7 64-bit, with Julia version 0.2.0 64-bit, installed using the instructions at Steven Johnson's site, with Anaconda installed first to take care of prerequisites. I am using Mathematica student edition version 9.0.1.
EDIT 1:
Executing versioninfo() yielded
Julia Version 0.2.0
Commit 05c6461 (2013-11-16 23:44 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
So it looks like I'm using the openBLAS for LAPACK and BLAS. Once I get the Mathematica implementation info I will add that as well.
EDIT 2:
It appears that Windows Mathematica probably uses Intel MKL BLAS.
The eigen calculation in Julia is outsourced to LAPACK and BLAS, and I think it is also the case for Mathematica. Julia can use different versions of BLAS and LAPACK and you are therefore effectively comparing your choice of LAPACK and BLAS for Julia with Mathematica's LAPACK and BLAS (probably Intel MKL).
The default choice for Julia is OpenBLAS which is fast on most architectures and on my machine Julia is faster than Mathematica for the eigen calculation. If you are on Linux and have chosen BLAS and LAPACK from a repo, it is very likely that they are much slower than OpenBLAS.
The option for balancing has recently been added to Julia and mistakenly the option was not added to the function eig, which is only a MATLAB compatible interface to the eigfact function. Writing eigfact(A,balance=:nobalance) should work.
Edit 1:
Further investigation has shown that the difference is due to a threading problem in OpenBLAS on Windows. If Julia's BLAS is restricted to one thread the timings are comparable to Mathematica, but if more threads are allowed the calculation slows down. This doesn't seem to be a problem on Mac or Linux, but as mentioned above, in general the performance of OpenBLAS depends on the architecture.
Edit 2:
Recently, the balancing option has changed. Balancing can be switched off by writing eigfact(A,permute=false,scale=false).
There's definitely at least one thing wrong, and of couse the good news is that it's likely to be fixable. For this kind of thing, you're much better off filing an issue on GitHub. I've done that for you here.
Regarding the speed, you may want to check that the accuracy of the eigendecomposition is comparable. And of course, accuracy sometimes depends on the condition number of the matrix; it's possible that Julia is using a more careful algorithm, and your example may or may not reveal that if its condition number is not a problem. This is definitely something to discuss over at the issue.
Regarding :nobalance, in the documentation, [, something] is used to indicate that something is optional. You want to use #time (E,F)=eig(D, balance=:nobalance);. However, eig doesn't take keyword arguments, so the code and the documentation are not currently in sync.

Compiled Simulink/Matlab x Fortran - Performance

I have to prove to my client that Fortran is faster than Matlab/Simulink. He is considering migrating a code from fortran to Matlab. The code is mainly logic and "procedural" subroutines. It does not use any native matrix operations or mathematical functions (eigenvalues, non linear equations, etc)
I think that the question of who is faster is already answered considering several references over the internet and the "intrinsic characteristics" of each language, but I need concrete data.
All charts that I found compare Matlab/Simulink x Fortran but do not specify if the Matlab code is compiled or not (using matlab coder toolbox). I think that it is a critical issue.
I´m not saying that compiling the code will make matlab faster than fortran, but in order to really convince someone I would like to see the results.
A good start would be:
Performance - Matlab (.m) compiled (Matlab coder toolbox) X Intel Fortran
Performance - Simulink compiled (Realtime toolbox) X Intel Fortran
Does anyone have already tested this scenario?
Matlab code that I recently "compiled" using the Matlab Coder produced a speed-up of x20 (!). The actual expected speedup depends on many things. If your Matlab code is highly vectorized and uses mainly linear-algebra routines, then the Coder is unlikely to produce much speedup. But if you have multiple loops and conditionals in your algorithm then you can indeed achieve order-of-magnitude speedup as in my example above.
Under the hood, Matlab's linear-algebra uses BLAS/LAPACK (via the MKL/ACML libraries), that use highly-optimized Fortran code. So unless you write extremely efficient Fortran, it is not likely that you will be able to outperform Matlab (despite the function-call overheads) for highly-vectorized Matlab linear-algebra/math algos. However, if your code uses conditionals/loops and similar non-math programming constructs, then the picture might change. In short, there's no simple answer - it depends on your specific algorithm/program.
Putting performance aside for the moment, Matlab has numerous other benefits over Fortran, including a vast array of tested built-in functions and enabling a rapid development cycle.
You would need to ask a more tightly defined question - there's no single answer to whether Fortran is faster than MATLAB/Simulink.
First of all, it's easy to write terrible, slow algorithms in either language. So you'd need to specify particular, well-written algorithms.
Secondly, there are many things for which MATLAB will be faster than even very well-written Fortran (or C). For example, if you want to multiply two big matrices together, or calculate some eigenvalues, or other linear algebra that is in MATLAB's sweet spot, you won't beat it. On the other hand if you're doing something with a lot more logic, that can't be vectorised, Fortran is likely to be faster (as long as it's written well).
When you introduce MATLAB Coder into the picture, these latter things are the ones that are most likely to benefit from a speedup by converting to C code (mostly because the former things really can't be sped up much, which is why you wouldn't beat them). But the speedup is variable - I've seen over 10-15x, but also sometimes only 1-2x.
You don't mention where you found the charts you have comparing MATLAB to Fortran, but if you've found them on the internet I would think it's a pretty safe assumption that they don't involve C code generation with MATLAB Coder, and represent the performance of just MATLAB.
Finally - one other method of speeding up MATLAB is to parallelize it with Parallel Computing Toolbox (which enables you to parallelize things over the cores on your local machine) and possibly also with Distributed Computing Server (parallelization on cluster). It's typically a lot easier to do this with MATLAB code than it is to speed up by using MATLAB Coder to produce C code - so if you think it's critical to consider MATLAB Coder in your comparisons, you should probably also consider this as well.
MATLAB Compiler will not make your code faster, it is intended for distributing your code to third party users that do not have MATLAB. You need to provide, along with your compiled code, the MCR or MATLAB Component Runtime, which is essentially a headless version of MATLAB, and which you can distribute freely if you have a license of MATLAB Compiler.
Now, if you use MATLAB Coder (or Simulink Coder for Simulink) to generate C code from your MATLAB code, then it is likely that you will get a speed up compared to interpreted MATLAB code. Even then, that depends on the code in question. Also, this only supports a subset of the MATLAB language, that is compatible with C code generation.

sparse matrices solver for f90

I am dealing with up to N=10^7 x N=10^7 matrices; number of nonzero elements is about 6 x N. (Those elements are grouped around diagonal.) My RAM has 16 Gbt size; so I clearly need sparse matrix solver. I run Ubuntu LINUX, and use fortran90 (gfortran), or precisely speaking, ratfor90.
I have LAPACK, but it doesn't seem to support sparse matrix solving.
(am I wrong with that?) MATLAB must be good, but I don't want to spend much time to get familiar with it; the time is pressing. I have old/gold slatec installed and use it for spec. functions; does it have sparse matrix routins?
I hear about ARPACK, but can it be used as a plain solver? could it be called from gfortran?
Any other suggestion?
Thanks, -- Alex
You are right. Lapack is not applicable to this problem.
Direct Sparse solvers are provided by MUMPS, UMFPACK, SuperLU libraries.
Also PETSc is a library collection where you can find a lot of information
You can find Ubuntu package available for all these libraries.
ARPACK is a package that solves eigenvalue problems, but it is not a solver by itself.
I am not sure you can solve your problem on 16 Gb. I recommend having a look at freefem++

fast matrix multiplication in Matlab

I need to make a matrix/vector multiplication in Matlab of very large sizes: "A" is an 655360 by 5 real-valued matrix that are not necessarily sparse and "B" is a 655360 by 1 real-valued vector. My question is how to compute: B'*A efficiently.
I have notice a slight time improvement by computing A'*B instead, which gives a column vector. But still it is quite slow (I need to perform this operation several times in the program).
With a little bit search I found an interesting Matlab toolbox MTIMESX by James Tursa, which I hoped would improve the above matrix multiplication performance. After several trials, I can only have very marginal gains over the Matlab native matrix multiplication.
Any suggestions about how should I rewrite A'*B so that the operation is more efficient? Thanks.
Matlab's raison d'etre is doing matrix computations. I would be fairly surprised if you could significantly outperform its built-in matrix multiplication with hand-crafted tools. First of all, you should make sure your multiplication can actually be performed significantly faster. You could do this by implementing a similar multiplication in C++ with Eigen.
I have had good results with matlab matrix multiplication using the GPU
In order to avoid the transpose operation, you could try:
sum(bsxfun(#times, A, B), 2)
But I would be astonished it was faster than the direct version. See #thiton's answer.
Also look at http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html to see why the column-vector-based version is faster than the row-vector-based version.
Matlab is built using fairly optimized libraries (BLAS, etc.), so you can't easily improve upon it from within Matlab. Where you can improve is to get a better BLAS, such as one optimized for your processor - this will enable better use of the caches by getting appropriately sized blocks of data from main memory. Take a look into creating your own compiled versions of ATLAS, ACML, MKL, and Goto BLAS.
I wouldn't try to solve this one particular multiplication unless it's really killing you. Changing up the BLAS is likely to lead to a happier solution, especially if you're not currently making use of multicore processors.
Your #1 option, if this is your bottleneck, is to re-examine your algorithm. See this question Optimizing MATLAB code for a great example of how choosing a different algorithm reduced runtime by three orders of magnitude.

Efficient EigenSolver Implementation

I am looking for an efficient eigensolver ( language not important, although I would be programming in C#), that utilizes the multi-core features found in modern CPU. Being able to work directly with pardiso solver is a major plus. My matrix are mostly sparse matrix, so an ideal solver should be able to take advantage of this fact and greatly enhance the memory usage and performance.
So far I have only found LAPACK and ARPACK. The LAPACK, as implemented in Intel MKL, is a good candidate, as it offers multi-core optimization. But it seems that the drivers inside the LAPACK don't work directly with pardiso solver, furthermore, it seems that they don't take advantage of sparse matrix ( but I am not sure on this point).
ARPACK, on the other hand, seems to be pretty hard to setup in Windows environment, and the parallel version, PARPACK, doesn't work so well. The bonus point is that it can work with pardiso solver.
The best would be Intel MKL + ARPACK with multi-core speedup. Not sure whether there is any existing implementations that already do what I want to do?
I'm working on a problem with needs very similar to the ones you state. I'm considering FEAST:
http://www.ecs.umass.edu/~polizzi/feast/index.htm
I'm trying to make it work right now, but it seems perfect. I'm interested in hearing what your experience with it is, if you use it.
cheers
Ned
Have a look at the Eigen2 library.
I've implemented it already, in C#.
The idea is that one must convert the matrix format in CSR format. Then, one can use MKL to compute linear equation solving algorithm ( using pardiso solver), the matrix-vector manipulation.

Resources