In non-linear solvers, what influences solver time vs NLP function evaluations? - performance

I have some difficulties in understanding how performances in non-linear optimisation are influenced by the specific way the solver engine is interfaced.
We have an optimisation model that, in its first version, was written in GAMS.
IPOPT (a common FOOS non-linear solver engine) was returning an execution time for each optimisation of 1.4 CPU seconds in IPOPT (w/o function evaluations) and 0.2 CPU seconds in function evaluation.
When we converted the model to C++ (for a better accounting of the non-optimisation components of the model) and interfaced IPOPT trough its C++ API (using ADOL-C and ColPack for AD) we got execution times of 0.7 secs in IPOPT and 9.4 secs in function evaluation (the improvement in IPOPT is likely due to the fact that, compiling IPOPT by source, we were able to use better linear solvers non available in the GAMS version of IPOPT).
So, using C++, admittedly using a badly optimised code, gave us results ~50 times slower than GAMS, partially compensated by better solver time.
We are now evaluating the feasibility to convert the model in other languages, either Python with Pyomo, or Julia with JuMP.
But we would like first to understand how the function evaluation made by the solver at each step depends from the specific language implemented.
With C++, it's pretty evident that the functions making the optimisation models are directly executed (evaluated) at each iteration, so the way they are implemented does matter (and in particular, gradient and hessian are recomputed each time, at least in our implementation).
How is with Pyomo and JuMP? Would it be each iteration evaluated in Python and Julia, or Pyomo and JuMP would instead render first the model in (I guess) C, compute (not evaluate) the gradient and hessian once for all, and then is this "C version" that would be evaluated each time ?
It clearly would make a big difference, especially for python..

Pyomo interfaces to Ipopt by converting the model to the NL file format. It assumes the "ipopt" executable is in your PATH (Ipopt compiled with ASL). All function evaluations that take place during optimization happen in C within the Ampl Solver Library.

JuMP has compared favorably with GAMS in our own benchmarks; take that as you may. The derivative computations are entirely in Julia (which is fast), no compiled C code.

Related

Julia - Parallel mathematical optimizers

Im using GLPK with through Julia, and I need to optimize the same GLPK.Prob repeatedly.
What changes between each optimization is that some combination of variables gets fixed to 0
simply put in pseudocode
lp = GLPK.Prob()
variables_to_block = [[1,2,3], [2,3], [6,7], [19,100,111]...]
for i in variables_to_block
block_vars(lp, i)
simplex(lp)
restore_vars(lp, i)
end
when I then run this, it looks like CPU1 acts as a scheduler, staying in the 9-11% range, and the load on CPU3 and CPU4 alternates between 0 and 100%, though never at the same time...the load on CPU2 stays at 0%
This can take a bit of time and I'd like to use all cores
There is however, a bit of a hassle to use the parallel features of Julia, particularly for lp-models because they involve pointers so (as far as I know) they cannot be copied easily between cores
Is there a way to either set the GLPK solver binary (or something) to automatically try to fully utilize all cores? either by compiling GLPK in such a way, or any other way
As far as I know, GLPK is not multithreaded. If you must have a multithreaded solver, then consider using a newer one such as Gurobi or MOSEK. They have free academic licenses.
If commercial solvers are anathema to your purposes, then perhaps try the Splitting Cone Solver. It is released for free under the MIT License.
Otherwise, the suggestions in the comments (e.g. batch proessing) may be your only recourse.

ode45 mex file runs slow

This is the function trial2 which creates the system of differential equations:
function xdot=trial2(t,x)
delta=0.1045;epsilon=0.0048685;
xdot=[x(2);(-delta-epsilon*cos(t))*x(1)-0.7*delta*abs(x(1))];
Then, I try and solve this using ode45:
[t,x]=ode45('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But, this takes approximately 15 minutes. So I tried to improve the run time by creating a mex file. I found an ode45 mex equivalent function ode45eq.c. Then, I tried to solve by:
mex ode45eq.c;
[t,x]=ode45eq('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But this seems to run even slower. What could be the reason and how can I improve the run time to 2–3 minutes? I have to perform a lot of these computations. Also, is it worth creating a standalone C++ file to solve the problem faster?
Also, I am running this on an Intel i5 processor 64-bit system with 8 gb RAM. How much gain in speed do you think I can get if I move to a better processor with say 16 gb RAM?
With current versions of Matlab, it's very unlikely that you'll see any performance improvement by compiling ode45 to C/C++ mex. In fact, the compiled version will almost certainly be slower as you found. There's a good reason that ode45 is written in pure Matlab, as opposed to being compiled down to a native C function: it has to call user functions written in Matlab on every iteration. Additionally, Matlab's ode45 is a very dynamic function that is capable of interacting with the Matlab environment in many ways during the course of integration (plotting output functions, event detection, interpolation, etc.). It's also probably more straightforward to safely handle dynamic memory allocation in Matlab, than in C.
Your C code calls your user function via mexCallMATLAB. This function is not really meant for repeated calls, especially if they transfer data back and forth. Doing what you're trying to do would likely require new mex APIs and possibly changes to the Matlab language.
If you want faster numerical integration, you're going to have to give up the convenience of being able to write your integrations functions (i.e., trial2 in your example) in Matlab. You'll need to "hard code" your integration functions and compile them along with the integration scheme itself. With a detailed knowledge about your problem, and decent programming skills, you can write a tight integration loop and it may be possible to achieve an order of magnitude speedup in some cases.
Lastly, your trial2 function has an absolute value as well as an oscillating trigonometric function in it. Is this differential equation stiff? Have you tried other solvers, e.g., ode15s? Compare the outputs even over a shorter time period. And you may find that you get a bit of a speed-up (~25% on my machine) if you use the modern way of passing function handles instead of strings to ode45:
[t,x] = ode45(#trial2,[0 10000000],[0;1]);
The trial2 function can still be in a separate M-file or it can be a sub-function in the same file with your call to ode45 (this file need to be a function file, not a script of course).
What you did is basically replacing the todays implementation with the implementation of 1993, you have shown that mathworks did a great job improving the performance of the ode45 solver.
I don't see any possibility to improve the performance in this case, you can assume that such a fundamental part of MATLAB like ode45 is implemented in a optimal way, replacing it with some other code to mex it is not the solution. Everything you could gain using a mex function is cutting of the overhead of input/output handling which is implemented in m. Probably less than 0.1s execution time.

Extremely fast method for modular exponentiation with modulus and exponent of several million digits

As a hobby project I'm taking a crack at finding really large prime numbers. The primality tests for this contain modular exponentiation calculations, i.e. a^e mod n. Let's call this the modpow operation to keep the explanation simple. I am wanting to speed up this particular calculation.
Currently I am using GMP's mpz_pown function, but, it is kind of slow. The reason I think it's too slow, is because a function call to GMP's modpow is slower than a full-blown primality test of the software called PFGW for the same large number. (So to be clear, this is just the GMP's modpow part, not my whole custom primality testing routine I am comparing). PFGW is considered the fastest in it's field and for my use case it uses a Brillhart-Lehmer-Selfridge primality test - which also uses the modpow procedure - so it's not because of mathematical cleverness that PFGW is faster in that aspect (please correct me if I'm wrong here). It looks like the bottleneck in GMP is the modpow operation. An example runtime for numbers which have a little over 20,000 digits: GMP's modpow operation takes about 45 seconds and PFGW finishes the whole primality test (involving a modpow) in 9 seconds flat. The difference gets even more impressive with even bigger numbers. GMP uses FFT multiplication and Montgomery reduction for this test comparison, see comments on this post below.
I did some research. So far I understand that the modpow algorithm uses exponentiation by squaring, integer multiplication and modulo reduction - these all sound very familiar to me. Several helper methods could improve the running time of integer multiplication:
Montgomery multiplication
FFT multiplication
To improve the running time of the exponentiation by squaring part, one may use a signed digit representation to reduce the number of multiplications (i.e. bits are represented as 0, 1 or -1, and the bit string is represented in such a way so that it contains many more zeros than in it's original base-2 representation - this reduces the running time of exponentiation by squaring).
For optimizing the modulo part of the operation, I know of these methods:
Montgomery reduction
So here is the 150,000 dollar question: is there a software library available to do a modpow operation efficiently given a very large base, exponent and modulus? (I'm aiming for several millions of digits). If you would like to suggest an option, please try to explain the inner workings of the algorithm for the case with millions of digits for the base, modulus and exponents, as some libraries use different algorithms based on the number of digits. Basically I am looking for a library which supports the techniques mentioned above (or possibly more clever techniques) and it should perform well while running the algorithm (well, better than GMP at least). So far I've searched, found and tried GMP and PFGW, but didn't find these satisfying (PFGW is fast, but I'm just interested in the modpow operation and there is no direct programming interface to that). I'm hoping that maybe an expert in the field can suggest a library with these capabilities, as there seem to be very few that are able to handle these requirements.
Edit: made the question more concise, as it was marked too broad.
First off, re. the Answer 1 writer's comment "I do not use GMP but I suspect when they wrote they use FFT
they really mean the NTT" -- no, when GMP says "FFT' it means a floating-point FFT. IIRC they also have some NTT-based routines, but for bignum mul those are uncompetitive with FFT.
The reason a well-tuned FFT-mul beats any NTT is that the slight loss of per-word precision due to roundoff error accumulation is more than made up for by the vastly superior floating-point capabilities of modern CPU offerings, especially when one considers high-performance implementations which make use of the vector-math capabilities of CPUs such as the x86_64 family, the current iterations of which - Intel Haswell, Broadwell and Skylake - have massive vector floating-point capability. (I don't cite AMD in this regard because their AVX offerings have lagged far behind Intel's; their high-water mark was circa 2002 and since then Intel has been beating the pants off them in progressively-worse fashion each year.) The reason GMP disappoints in this area is that GMP's FFT is, relatively speaking, crap. I have great respect for the GMP coders overall, but FFT timings are FFT timings, you don't get points for effort or e.g. having a really good bignum add. Here is a paper detailing a raft of GMP FFT-mul improvements:
Pierrick Gaudry, Alex Kruppa, Paul Zimmerman: "A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm" [http://www.loria.fr/~gaudry/publis/issac07.pdf]
This is from 2007, but AFAIK the performance gap noted in the snippet below has not been narrowed; if anything it has widened. The paper is excellent for detailing various mathematical and algorithmic improvements which can be deployed, but let's cut to the money quote:
"A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^2wn − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our im- plementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron."
The next few paragraphs are a raft of face-saving spin. (And again, I am personally acquainted with 2 of the 3 authors, and they are all top guys in the field of computational number theory.)
Note that the aforementioned George Woltman, whose Prime95 code has discovered all of the world-record primes since shortly after its debut 20 years ago, has made his core bignum routines available in a general API-ized form called the GWNUM library. You mentioned how much faster PFGW is than GMP for FFT-mul - that's because PFGW uses GWNUM for the core 'heavy lifting' arithmetic, that's where the 'GW' in PFGW comes from.
My own FFT implementation, which has generic-C build support but like George's uses reams of x86 vector-math assembler for high performance on that CPU family, is roughly 60-70% slower than George's on current Intel processor families. I believe that makes it the world's 2nd-fastest bignum-mul code on x86. By way of example, my code is currently running a primality test on a number with roughly 2^29 bits using a 30-Mdouble-length FFT (30*2^20 doubles); thus a little more than 17 bits per input word. Using all four of my 3.3 GHz Haswell 4670 quad's cores it takes ~90 ms per modmul.
BTW, many (if not most) of the world's top bignum-math coders hang out at mersenneforum.org, I encourage you to check it out and ask your questions to the broader (at least in this particular area) expert audience there. I appear under the same handle there as here; George Woltman appears as "Prime95', PFGW's Mark Rodenkirch goes as "rogue".
I do not use GMP at all so handle this with that in mind.
I rather use NTT instead of FFT for multiplication
it removes the rounding errors and in comparison to mine FFT implementations optimized to the same point is faster
C++ NTT implementation
C++ NTT sqr(mul) implementation
as I mentioned I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT (finite field Fourier transform)
The speed difference of your test and the GMP primality test can be caused by modpow call.
if there are too much calls to it then it causes heap/stack trashing which slows things down considerably. Especially for bignums. Try to avoid heap trashing so eliminate as much data from operands and return calls as possible for frequently called functions. Also sometimes helps to eliminate the bottleneck call by directly copy the function source code into your code instead of the call (or use of macros instead) with use of local variables only.
I think GMP published their source code so finding their implementation of modpow should not be too hard. You just have to use it correctly
just to be clear
you are using big numbers like 20000+ decadic digits which means ~8.4 KBytes per number. Any return or non pointer operand means to copy that amount of data to/from heap stack. This not only takes time but also usually invalidates the CPU's CACHE which also kills performance.
Now multiply this by the algorithm iterations number and you get the Idea. Had similar problems while tweaking many of mine bignum functions and the speedup is often more than 10000% (100 times) even if there is no change in the algorithm used. Just limiting/eliminating heap/stack trashing.
So I do not think you need better modpow implementation just better usage of it but of coarse I may be wrong but without the code you are using is hard to deduce more.

Compiled Simulink/Matlab x Fortran - Performance

I have to prove to my client that Fortran is faster than Matlab/Simulink. He is considering migrating a code from fortran to Matlab. The code is mainly logic and "procedural" subroutines. It does not use any native matrix operations or mathematical functions (eigenvalues, non linear equations, etc)
I think that the question of who is faster is already answered considering several references over the internet and the "intrinsic characteristics" of each language, but I need concrete data.
All charts that I found compare Matlab/Simulink x Fortran but do not specify if the Matlab code is compiled or not (using matlab coder toolbox). I think that it is a critical issue.
I´m not saying that compiling the code will make matlab faster than fortran, but in order to really convince someone I would like to see the results.
A good start would be:
Performance - Matlab (.m) compiled (Matlab coder toolbox) X Intel Fortran
Performance - Simulink compiled (Realtime toolbox) X Intel Fortran
Does anyone have already tested this scenario?
Matlab code that I recently "compiled" using the Matlab Coder produced a speed-up of x20 (!). The actual expected speedup depends on many things. If your Matlab code is highly vectorized and uses mainly linear-algebra routines, then the Coder is unlikely to produce much speedup. But if you have multiple loops and conditionals in your algorithm then you can indeed achieve order-of-magnitude speedup as in my example above.
Under the hood, Matlab's linear-algebra uses BLAS/LAPACK (via the MKL/ACML libraries), that use highly-optimized Fortran code. So unless you write extremely efficient Fortran, it is not likely that you will be able to outperform Matlab (despite the function-call overheads) for highly-vectorized Matlab linear-algebra/math algos. However, if your code uses conditionals/loops and similar non-math programming constructs, then the picture might change. In short, there's no simple answer - it depends on your specific algorithm/program.
Putting performance aside for the moment, Matlab has numerous other benefits over Fortran, including a vast array of tested built-in functions and enabling a rapid development cycle.
You would need to ask a more tightly defined question - there's no single answer to whether Fortran is faster than MATLAB/Simulink.
First of all, it's easy to write terrible, slow algorithms in either language. So you'd need to specify particular, well-written algorithms.
Secondly, there are many things for which MATLAB will be faster than even very well-written Fortran (or C). For example, if you want to multiply two big matrices together, or calculate some eigenvalues, or other linear algebra that is in MATLAB's sweet spot, you won't beat it. On the other hand if you're doing something with a lot more logic, that can't be vectorised, Fortran is likely to be faster (as long as it's written well).
When you introduce MATLAB Coder into the picture, these latter things are the ones that are most likely to benefit from a speedup by converting to C code (mostly because the former things really can't be sped up much, which is why you wouldn't beat them). But the speedup is variable - I've seen over 10-15x, but also sometimes only 1-2x.
You don't mention where you found the charts you have comparing MATLAB to Fortran, but if you've found them on the internet I would think it's a pretty safe assumption that they don't involve C code generation with MATLAB Coder, and represent the performance of just MATLAB.
Finally - one other method of speeding up MATLAB is to parallelize it with Parallel Computing Toolbox (which enables you to parallelize things over the cores on your local machine) and possibly also with Distributed Computing Server (parallelization on cluster). It's typically a lot easier to do this with MATLAB code than it is to speed up by using MATLAB Coder to produce C code - so if you think it's critical to consider MATLAB Coder in your comparisons, you should probably also consider this as well.
MATLAB Compiler will not make your code faster, it is intended for distributing your code to third party users that do not have MATLAB. You need to provide, along with your compiled code, the MCR or MATLAB Component Runtime, which is essentially a headless version of MATLAB, and which you can distribute freely if you have a license of MATLAB Compiler.
Now, if you use MATLAB Coder (or Simulink Coder for Simulink) to generate C code from your MATLAB code, then it is likely that you will get a speed up compared to interpreted MATLAB code. Even then, that depends on the code in question. Also, this only supports a subset of the MATLAB language, that is compatible with C code generation.

Does an ODE written using GNU gsl outperform Mathematica's NDSolve?

Would an ODE solver written in C perhaps using the GSL library have significant speed advantages compared with Mathematica 8.0 NDSolve? How would it fair in terms of accuracy?
My understanding is that compiled code could in principle be faster, but that these days NDSolve uses a lot of compiled code itself already somehow?
Also are there any options for using things like MathLink or Mathematica's compile function to speed solving an ODE up?
NDSolve and other numerical functions in Mathematica automatically compile your operand (e.g. the RHS of an ODE) to an intermediate "bytecode" language (the same one used by the Compile function). If you like you can specify CompilationTarget -> "C" and the function will be compiled all the way to C code and linked back in to Mathematica... You can see the generated C code yourself in this previous question on the Mathematica Stack Exchange:
https://mathematica.stackexchange.com/questions/821/how-well-does-mathematica-code-exported-to-c-compare-to-code-directly-written-fo/830#830
Of course, it's always possible in principle to hand-write a faster algorithm... But there are a lot of things to optimize that Mathematica will do automatically. You probably don't want to be responsible for manually optimizing the computation of a sparse matrix of partial derivatives in an optimization problem for example.
Mathematica's focus is on usability. They do use numerical libraries. So the speed would be the same as the best available library or worse (in almost all cases). for example, i heard they use eigen for matrix stuff.
the other thing that you should consider is that although they optimize functions that they provide, your own functions are not optimized. so the derivative that you calculate at each step would be faster in c.
to my friends that decide between mathematica and c++, i tell to go with mathematica since they should focus on getting results fast rather than building the fastest code.

Resources