Eigendecompositions are 5 times slower in Julia than in Mathematica? - performance

I am new to Julia and primarily work in Mathematica, so I probably have a few elementary mistakes floating around. I attempted to time how long Julia took to compute the eigensystem of a random matrix, and found it was 5-6 times slower than in Mathematica.
In Julia:
D=1000*(rand(1000,1000)-0.5);
#time (E,F)=eig(D);
Out: elapsed time: 7.47950706 seconds (79638920 bytes allocated*)
In Mathematica:
First#Timing#Eigensystem[RandomReal[{-500, 500}, {1000, 1000}]]
Out: 1.310408
For 2000 x 2000 arrays it's similar, although the Julia result slowed down slightly less than the equivalent Mathematica call, but it's still slower; Julia takes 22 seconds, whereas Mathematica computes it in 8 seconds.
As far as I read in the Julia standard library for linear algebra, decompositions are implemented by calling LAPACK, which I thought was supposed to be very good, so I'm confused as to why the Julia code is running so much slower. Does anyone know why this is the case? Is it doing some kind of balancing or array-symmetry-detection that Mathematica doesn't do? Or is it actually slower?
Also, this is a syntax question and probably a silly error, but how do you change the balancing in Julia? I tried
#time (E,F)=eig(D[, balance=:nobalance]);
exactly as copied and pasted from the Julia manual, but it just gave a syntax error, so something's wrong.
I am using Windows 7 64-bit, with Julia version 0.2.0 64-bit, installed using the instructions at Steven Johnson's site, with Anaconda installed first to take care of prerequisites. I am using Mathematica student edition version 9.0.1.
EDIT 1:
Executing versioninfo() yielded
Julia Version 0.2.0
Commit 05c6461 (2013-11-16 23:44 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
So it looks like I'm using the openBLAS for LAPACK and BLAS. Once I get the Mathematica implementation info I will add that as well.
EDIT 2:
It appears that Windows Mathematica probably uses Intel MKL BLAS.

The eigen calculation in Julia is outsourced to LAPACK and BLAS, and I think it is also the case for Mathematica. Julia can use different versions of BLAS and LAPACK and you are therefore effectively comparing your choice of LAPACK and BLAS for Julia with Mathematica's LAPACK and BLAS (probably Intel MKL).
The default choice for Julia is OpenBLAS which is fast on most architectures and on my machine Julia is faster than Mathematica for the eigen calculation. If you are on Linux and have chosen BLAS and LAPACK from a repo, it is very likely that they are much slower than OpenBLAS.
The option for balancing has recently been added to Julia and mistakenly the option was not added to the function eig, which is only a MATLAB compatible interface to the eigfact function. Writing eigfact(A,balance=:nobalance) should work.
Edit 1:
Further investigation has shown that the difference is due to a threading problem in OpenBLAS on Windows. If Julia's BLAS is restricted to one thread the timings are comparable to Mathematica, but if more threads are allowed the calculation slows down. This doesn't seem to be a problem on Mac or Linux, but as mentioned above, in general the performance of OpenBLAS depends on the architecture.
Edit 2:
Recently, the balancing option has changed. Balancing can be switched off by writing eigfact(A,permute=false,scale=false).

There's definitely at least one thing wrong, and of couse the good news is that it's likely to be fixable. For this kind of thing, you're much better off filing an issue on GitHub. I've done that for you here.
Regarding the speed, you may want to check that the accuracy of the eigendecomposition is comparable. And of course, accuracy sometimes depends on the condition number of the matrix; it's possible that Julia is using a more careful algorithm, and your example may or may not reveal that if its condition number is not a problem. This is definitely something to discuss over at the issue.
Regarding :nobalance, in the documentation, [, something] is used to indicate that something is optional. You want to use #time (E,F)=eig(D, balance=:nobalance);. However, eig doesn't take keyword arguments, so the code and the documentation are not currently in sync.

Related

tensorflow pow function is slow

I am using tensorflow for accelerating a stiff chemistry solver. In the process, I often have to calculate tf.pow(a,b) where a is a tensor and b is a constant. During profiling, I found that tf.pow was quite slow, even slower than tf.exp. I was surprised by that so I calculated the power as tf.exp(tf.log(a)*b) and timed it. To my surprise, the exponental log was twice as fast as tf.pow. Why is that? It was quite unexpected.
I should mention I am using single precision floats as my tensors, and I'm running on windows with python 3.6 and tf v1.5 on a CPU using the pip installed whl file using conda
I believe Tensorflow's exp and pow operations are calling Eigen's implementations. It appears that Eigen is using SIMD instructions for exp but not for pow: https://eigen.tuxfamily.org/dox/group__CoeffwiseMathFunctions.html

Compiled Simulink/Matlab x Fortran - Performance

I have to prove to my client that Fortran is faster than Matlab/Simulink. He is considering migrating a code from fortran to Matlab. The code is mainly logic and "procedural" subroutines. It does not use any native matrix operations or mathematical functions (eigenvalues, non linear equations, etc)
I think that the question of who is faster is already answered considering several references over the internet and the "intrinsic characteristics" of each language, but I need concrete data.
All charts that I found compare Matlab/Simulink x Fortran but do not specify if the Matlab code is compiled or not (using matlab coder toolbox). I think that it is a critical issue.
I´m not saying that compiling the code will make matlab faster than fortran, but in order to really convince someone I would like to see the results.
A good start would be:
Performance - Matlab (.m) compiled (Matlab coder toolbox) X Intel Fortran
Performance - Simulink compiled (Realtime toolbox) X Intel Fortran
Does anyone have already tested this scenario?
Matlab code that I recently "compiled" using the Matlab Coder produced a speed-up of x20 (!). The actual expected speedup depends on many things. If your Matlab code is highly vectorized and uses mainly linear-algebra routines, then the Coder is unlikely to produce much speedup. But if you have multiple loops and conditionals in your algorithm then you can indeed achieve order-of-magnitude speedup as in my example above.
Under the hood, Matlab's linear-algebra uses BLAS/LAPACK (via the MKL/ACML libraries), that use highly-optimized Fortran code. So unless you write extremely efficient Fortran, it is not likely that you will be able to outperform Matlab (despite the function-call overheads) for highly-vectorized Matlab linear-algebra/math algos. However, if your code uses conditionals/loops and similar non-math programming constructs, then the picture might change. In short, there's no simple answer - it depends on your specific algorithm/program.
Putting performance aside for the moment, Matlab has numerous other benefits over Fortran, including a vast array of tested built-in functions and enabling a rapid development cycle.
You would need to ask a more tightly defined question - there's no single answer to whether Fortran is faster than MATLAB/Simulink.
First of all, it's easy to write terrible, slow algorithms in either language. So you'd need to specify particular, well-written algorithms.
Secondly, there are many things for which MATLAB will be faster than even very well-written Fortran (or C). For example, if you want to multiply two big matrices together, or calculate some eigenvalues, or other linear algebra that is in MATLAB's sweet spot, you won't beat it. On the other hand if you're doing something with a lot more logic, that can't be vectorised, Fortran is likely to be faster (as long as it's written well).
When you introduce MATLAB Coder into the picture, these latter things are the ones that are most likely to benefit from a speedup by converting to C code (mostly because the former things really can't be sped up much, which is why you wouldn't beat them). But the speedup is variable - I've seen over 10-15x, but also sometimes only 1-2x.
You don't mention where you found the charts you have comparing MATLAB to Fortran, but if you've found them on the internet I would think it's a pretty safe assumption that they don't involve C code generation with MATLAB Coder, and represent the performance of just MATLAB.
Finally - one other method of speeding up MATLAB is to parallelize it with Parallel Computing Toolbox (which enables you to parallelize things over the cores on your local machine) and possibly also with Distributed Computing Server (parallelization on cluster). It's typically a lot easier to do this with MATLAB code than it is to speed up by using MATLAB Coder to produce C code - so if you think it's critical to consider MATLAB Coder in your comparisons, you should probably also consider this as well.
MATLAB Compiler will not make your code faster, it is intended for distributing your code to third party users that do not have MATLAB. You need to provide, along with your compiled code, the MCR or MATLAB Component Runtime, which is essentially a headless version of MATLAB, and which you can distribute freely if you have a license of MATLAB Compiler.
Now, if you use MATLAB Coder (or Simulink Coder for Simulink) to generate C code from your MATLAB code, then it is likely that you will get a speed up compared to interpreted MATLAB code. Even then, that depends on the code in question. Also, this only supports a subset of the MATLAB language, that is compatible with C code generation.

sparse matrices solver for f90

I am dealing with up to N=10^7 x N=10^7 matrices; number of nonzero elements is about 6 x N. (Those elements are grouped around diagonal.) My RAM has 16 Gbt size; so I clearly need sparse matrix solver. I run Ubuntu LINUX, and use fortran90 (gfortran), or precisely speaking, ratfor90.
I have LAPACK, but it doesn't seem to support sparse matrix solving.
(am I wrong with that?) MATLAB must be good, but I don't want to spend much time to get familiar with it; the time is pressing. I have old/gold slatec installed and use it for spec. functions; does it have sparse matrix routins?
I hear about ARPACK, but can it be used as a plain solver? could it be called from gfortran?
Any other suggestion?
Thanks, -- Alex
You are right. Lapack is not applicable to this problem.
Direct Sparse solvers are provided by MUMPS, UMFPACK, SuperLU libraries.
Also PETSc is a library collection where you can find a lot of information
You can find Ubuntu package available for all these libraries.
ARPACK is a package that solves eigenvalue problems, but it is not a solver by itself.
I am not sure you can solve your problem on 16 Gb. I recommend having a look at freefem++

Scientific library in C/C with OpenMP

I have to exploit PpenMP in some algorithm and for this purpose I need some mathematical functions, like eig or svd as it is available in MATLAB and it is quite fast in MATLAB. I already tried the following libraries with OpenMP
GSL - GNU Scientific Library
Eigen C++ template library
but I don't know why my OpenMP parallelised code is much slower than the serial code, may be there is some thing wrong in the library, or that the function random, eig or svd are blocking? I have no idea how to figure it out, can some body suggest me which is most compatible math library with OpenMP.
I can recommend Intel's MKL; note that it costs money which may affect your decision. I neither know nor care what language(s) it is written in, just so long as it provides APIs callable from my chosen language. Mine is Fortran, but it has bindings for C too
If you look around SO you'll find many questions from people whose first (or second or third) OpenMP programs were actually slower than their serial versions. Look at some of the answers. Don't conclude that there is a magic bullet, in the shape of a library, to make your code faster. Instead, realise that it is most likely that you've written a poorly-parallelised program and fix that.
Finally, if you have an installation of Matlab, don't expect to be able to write your own routines to outperform Matlab's. I won't say it can't be done, but I think you'll find it very difficult.
GSL is compatible with OpenMP. You can try with Intel Math Kernel Library which comes as a trial version for free.
If the speed up is not so much, then probably the code is not much parallelizable. You may want to debug and see the details of the running threads in Intel Thread Checker, that could be helpful to see where the bottlenecks are.
I think you just want to find a fast implementation of lapack (or related routines) which is already threaded, but it's a little hard to tell from your question. High Performance Mark suggests MKL, which is an excellent example; others include ATLAS or FLAME which are open source but take some doing to build.

Efficient EigenSolver Implementation

I am looking for an efficient eigensolver ( language not important, although I would be programming in C#), that utilizes the multi-core features found in modern CPU. Being able to work directly with pardiso solver is a major plus. My matrix are mostly sparse matrix, so an ideal solver should be able to take advantage of this fact and greatly enhance the memory usage and performance.
So far I have only found LAPACK and ARPACK. The LAPACK, as implemented in Intel MKL, is a good candidate, as it offers multi-core optimization. But it seems that the drivers inside the LAPACK don't work directly with pardiso solver, furthermore, it seems that they don't take advantage of sparse matrix ( but I am not sure on this point).
ARPACK, on the other hand, seems to be pretty hard to setup in Windows environment, and the parallel version, PARPACK, doesn't work so well. The bonus point is that it can work with pardiso solver.
The best would be Intel MKL + ARPACK with multi-core speedup. Not sure whether there is any existing implementations that already do what I want to do?
I'm working on a problem with needs very similar to the ones you state. I'm considering FEAST:
http://www.ecs.umass.edu/~polizzi/feast/index.htm
I'm trying to make it work right now, but it seems perfect. I'm interested in hearing what your experience with it is, if you use it.
cheers
Ned
Have a look at the Eigen2 library.
I've implemented it already, in C#.
The idea is that one must convert the matrix format in CSR format. Then, one can use MKL to compute linear equation solving algorithm ( using pardiso solver), the matrix-vector manipulation.

Resources