tensorflow pow function is slow - performance

I am using tensorflow for accelerating a stiff chemistry solver. In the process, I often have to calculate tf.pow(a,b) where a is a tensor and b is a constant. During profiling, I found that tf.pow was quite slow, even slower than tf.exp. I was surprised by that so I calculated the power as tf.exp(tf.log(a)*b) and timed it. To my surprise, the exponental log was twice as fast as tf.pow. Why is that? It was quite unexpected.
I should mention I am using single precision floats as my tensors, and I'm running on windows with python 3.6 and tf v1.5 on a CPU using the pip installed whl file using conda

I believe Tensorflow's exp and pow operations are calling Eigen's implementations. It appears that Eigen is using SIMD instructions for exp but not for pow: https://eigen.tuxfamily.org/dox/group__CoeffwiseMathFunctions.html

Related

Eigen + MKL uses single core for complex matrix (ZHEEV)

there is a weird behavior from mkl on our cluster. I am calling Eigen::SelfAdjointEigenSolverEigen::MatrixXcd for a complex matrix (ZHEEV).
When I calculate the eigenvectors for large matrices (dim >~ 100k) it only uses a single core.
Strangely, it runs perfectly fine (multiple cores) for smaller complex matrices, real matrices and large complex matrices (dim >~ 100k) without eigenvectors.
Did anyone face the same issue or has any idea what is going on in the background?
I tried various mkl versions.
This issue is well known with OpenBLAS and Netlib's reference Lapack due to poor zlasr performance ref. If you can, switch to zheevr or use Intel's MKL (it seems to take another route to avoid zlasr). MKL definitively don't exhibits this issue. (you didn't mention which versions of 'MKL' you've tried.)

In non-linear solvers, what influences solver time vs NLP function evaluations?

I have some difficulties in understanding how performances in non-linear optimisation are influenced by the specific way the solver engine is interfaced.
We have an optimisation model that, in its first version, was written in GAMS.
IPOPT (a common FOOS non-linear solver engine) was returning an execution time for each optimisation of 1.4 CPU seconds in IPOPT (w/o function evaluations) and 0.2 CPU seconds in function evaluation.
When we converted the model to C++ (for a better accounting of the non-optimisation components of the model) and interfaced IPOPT trough its C++ API (using ADOL-C and ColPack for AD) we got execution times of 0.7 secs in IPOPT and 9.4 secs in function evaluation (the improvement in IPOPT is likely due to the fact that, compiling IPOPT by source, we were able to use better linear solvers non available in the GAMS version of IPOPT).
So, using C++, admittedly using a badly optimised code, gave us results ~50 times slower than GAMS, partially compensated by better solver time.
We are now evaluating the feasibility to convert the model in other languages, either Python with Pyomo, or Julia with JuMP.
But we would like first to understand how the function evaluation made by the solver at each step depends from the specific language implemented.
With C++, it's pretty evident that the functions making the optimisation models are directly executed (evaluated) at each iteration, so the way they are implemented does matter (and in particular, gradient and hessian are recomputed each time, at least in our implementation).
How is with Pyomo and JuMP? Would it be each iteration evaluated in Python and Julia, or Pyomo and JuMP would instead render first the model in (I guess) C, compute (not evaluate) the gradient and hessian once for all, and then is this "C version" that would be evaluated each time ?
It clearly would make a big difference, especially for python..
Pyomo interfaces to Ipopt by converting the model to the NL file format. It assumes the "ipopt" executable is in your PATH (Ipopt compiled with ASL). All function evaluations that take place during optimization happen in C within the Ampl Solver Library.
JuMP has compared favorably with GAMS in our own benchmarks; take that as you may. The derivative computations are entirely in Julia (which is fast), no compiled C code.

Eigendecompositions are 5 times slower in Julia than in Mathematica?

I am new to Julia and primarily work in Mathematica, so I probably have a few elementary mistakes floating around. I attempted to time how long Julia took to compute the eigensystem of a random matrix, and found it was 5-6 times slower than in Mathematica.
In Julia:
D=1000*(rand(1000,1000)-0.5);
#time (E,F)=eig(D);
Out: elapsed time: 7.47950706 seconds (79638920 bytes allocated*)
In Mathematica:
First#Timing#Eigensystem[RandomReal[{-500, 500}, {1000, 1000}]]
Out: 1.310408
For 2000 x 2000 arrays it's similar, although the Julia result slowed down slightly less than the equivalent Mathematica call, but it's still slower; Julia takes 22 seconds, whereas Mathematica computes it in 8 seconds.
As far as I read in the Julia standard library for linear algebra, decompositions are implemented by calling LAPACK, which I thought was supposed to be very good, so I'm confused as to why the Julia code is running so much slower. Does anyone know why this is the case? Is it doing some kind of balancing or array-symmetry-detection that Mathematica doesn't do? Or is it actually slower?
Also, this is a syntax question and probably a silly error, but how do you change the balancing in Julia? I tried
#time (E,F)=eig(D[, balance=:nobalance]);
exactly as copied and pasted from the Julia manual, but it just gave a syntax error, so something's wrong.
I am using Windows 7 64-bit, with Julia version 0.2.0 64-bit, installed using the instructions at Steven Johnson's site, with Anaconda installed first to take care of prerequisites. I am using Mathematica student edition version 9.0.1.
EDIT 1:
Executing versioninfo() yielded
Julia Version 0.2.0
Commit 05c6461 (2013-11-16 23:44 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
So it looks like I'm using the openBLAS for LAPACK and BLAS. Once I get the Mathematica implementation info I will add that as well.
EDIT 2:
It appears that Windows Mathematica probably uses Intel MKL BLAS.
The eigen calculation in Julia is outsourced to LAPACK and BLAS, and I think it is also the case for Mathematica. Julia can use different versions of BLAS and LAPACK and you are therefore effectively comparing your choice of LAPACK and BLAS for Julia with Mathematica's LAPACK and BLAS (probably Intel MKL).
The default choice for Julia is OpenBLAS which is fast on most architectures and on my machine Julia is faster than Mathematica for the eigen calculation. If you are on Linux and have chosen BLAS and LAPACK from a repo, it is very likely that they are much slower than OpenBLAS.
The option for balancing has recently been added to Julia and mistakenly the option was not added to the function eig, which is only a MATLAB compatible interface to the eigfact function. Writing eigfact(A,balance=:nobalance) should work.
Edit 1:
Further investigation has shown that the difference is due to a threading problem in OpenBLAS on Windows. If Julia's BLAS is restricted to one thread the timings are comparable to Mathematica, but if more threads are allowed the calculation slows down. This doesn't seem to be a problem on Mac or Linux, but as mentioned above, in general the performance of OpenBLAS depends on the architecture.
Edit 2:
Recently, the balancing option has changed. Balancing can be switched off by writing eigfact(A,permute=false,scale=false).
There's definitely at least one thing wrong, and of couse the good news is that it's likely to be fixable. For this kind of thing, you're much better off filing an issue on GitHub. I've done that for you here.
Regarding the speed, you may want to check that the accuracy of the eigendecomposition is comparable. And of course, accuracy sometimes depends on the condition number of the matrix; it's possible that Julia is using a more careful algorithm, and your example may or may not reveal that if its condition number is not a problem. This is definitely something to discuss over at the issue.
Regarding :nobalance, in the documentation, [, something] is used to indicate that something is optional. You want to use #time (E,F)=eig(D, balance=:nobalance);. However, eig doesn't take keyword arguments, so the code and the documentation are not currently in sync.

sparse matrices solver for f90

I am dealing with up to N=10^7 x N=10^7 matrices; number of nonzero elements is about 6 x N. (Those elements are grouped around diagonal.) My RAM has 16 Gbt size; so I clearly need sparse matrix solver. I run Ubuntu LINUX, and use fortran90 (gfortran), or precisely speaking, ratfor90.
I have LAPACK, but it doesn't seem to support sparse matrix solving.
(am I wrong with that?) MATLAB must be good, but I don't want to spend much time to get familiar with it; the time is pressing. I have old/gold slatec installed and use it for spec. functions; does it have sparse matrix routins?
I hear about ARPACK, but can it be used as a plain solver? could it be called from gfortran?
Any other suggestion?
Thanks, -- Alex
You are right. Lapack is not applicable to this problem.
Direct Sparse solvers are provided by MUMPS, UMFPACK, SuperLU libraries.
Also PETSc is a library collection where you can find a lot of information
You can find Ubuntu package available for all these libraries.
ARPACK is a package that solves eigenvalue problems, but it is not a solver by itself.
I am not sure you can solve your problem on 16 Gb. I recommend having a look at freefem++

Integrating ODEs on the GPU using boost and python

I posted here not too long ago about a model I am trying to build using pycuda which solves About 9000 coupled ODEs. My model is too slow however and an SO member suggested that memory transfers from host to GPU is probably the culprit.
Right now cuda is being used only to calculate the rate of change of each of the 9000 species I am dealing with. Since I am passing in an array from the host to the GPU to perform this calculation and returning an array from the GPU to integrate on the host I can see how this would slow things down.
Would boost be the solution to my problem? From what I read, boost allows interoperability between c++ and python. It also includes c++ odeint , which I read, partnered with thrust allows quick reduction and integration all on the GPU. Is my understanding correct?
Thank you,
Karsten
Yes, boost.odeint and boost.python should solve your problem. You can use odeint with Thrust. There are also some OpenCL libraries (VexCL, ViennaCL) which might be easier to use then Thrust. Have a look at thist paper for a comparions and for use cases of odeint on GPUs.
Boost.python can do the communication between the C++ application and Python. Another approach would be a very slim command line application for solving the ODE (using boost.odeint) and which is entirely controlled by your python application.

Resources