Would an ODE solver written in C perhaps using the GSL library have significant speed advantages compared with Mathematica 8.0 NDSolve? How would it fair in terms of accuracy?
My understanding is that compiled code could in principle be faster, but that these days NDSolve uses a lot of compiled code itself already somehow?
Also are there any options for using things like MathLink or Mathematica's compile function to speed solving an ODE up?
NDSolve and other numerical functions in Mathematica automatically compile your operand (e.g. the RHS of an ODE) to an intermediate "bytecode" language (the same one used by the Compile function). If you like you can specify CompilationTarget -> "C" and the function will be compiled all the way to C code and linked back in to Mathematica... You can see the generated C code yourself in this previous question on the Mathematica Stack Exchange:
https://mathematica.stackexchange.com/questions/821/how-well-does-mathematica-code-exported-to-c-compare-to-code-directly-written-fo/830#830
Of course, it's always possible in principle to hand-write a faster algorithm... But there are a lot of things to optimize that Mathematica will do automatically. You probably don't want to be responsible for manually optimizing the computation of a sparse matrix of partial derivatives in an optimization problem for example.
Mathematica's focus is on usability. They do use numerical libraries. So the speed would be the same as the best available library or worse (in almost all cases). for example, i heard they use eigen for matrix stuff.
the other thing that you should consider is that although they optimize functions that they provide, your own functions are not optimized. so the derivative that you calculate at each step would be faster in c.
to my friends that decide between mathematica and c++, i tell to go with mathematica since they should focus on getting results fast rather than building the fastest code.
Related
I have some difficulties in understanding how performances in non-linear optimisation are influenced by the specific way the solver engine is interfaced.
We have an optimisation model that, in its first version, was written in GAMS.
IPOPT (a common FOOS non-linear solver engine) was returning an execution time for each optimisation of 1.4 CPU seconds in IPOPT (w/o function evaluations) and 0.2 CPU seconds in function evaluation.
When we converted the model to C++ (for a better accounting of the non-optimisation components of the model) and interfaced IPOPT trough its C++ API (using ADOL-C and ColPack for AD) we got execution times of 0.7 secs in IPOPT and 9.4 secs in function evaluation (the improvement in IPOPT is likely due to the fact that, compiling IPOPT by source, we were able to use better linear solvers non available in the GAMS version of IPOPT).
So, using C++, admittedly using a badly optimised code, gave us results ~50 times slower than GAMS, partially compensated by better solver time.
We are now evaluating the feasibility to convert the model in other languages, either Python with Pyomo, or Julia with JuMP.
But we would like first to understand how the function evaluation made by the solver at each step depends from the specific language implemented.
With C++, it's pretty evident that the functions making the optimisation models are directly executed (evaluated) at each iteration, so the way they are implemented does matter (and in particular, gradient and hessian are recomputed each time, at least in our implementation).
How is with Pyomo and JuMP? Would it be each iteration evaluated in Python and Julia, or Pyomo and JuMP would instead render first the model in (I guess) C, compute (not evaluate) the gradient and hessian once for all, and then is this "C version" that would be evaluated each time ?
It clearly would make a big difference, especially for python..
Pyomo interfaces to Ipopt by converting the model to the NL file format. It assumes the "ipopt" executable is in your PATH (Ipopt compiled with ASL). All function evaluations that take place during optimization happen in C within the Ampl Solver Library.
JuMP has compared favorably with GAMS in our own benchmarks; take that as you may. The derivative computations are entirely in Julia (which is fast), no compiled C code.
I have to prove to my client that Fortran is faster than Matlab/Simulink. He is considering migrating a code from fortran to Matlab. The code is mainly logic and "procedural" subroutines. It does not use any native matrix operations or mathematical functions (eigenvalues, non linear equations, etc)
I think that the question of who is faster is already answered considering several references over the internet and the "intrinsic characteristics" of each language, but I need concrete data.
All charts that I found compare Matlab/Simulink x Fortran but do not specify if the Matlab code is compiled or not (using matlab coder toolbox). I think that it is a critical issue.
I´m not saying that compiling the code will make matlab faster than fortran, but in order to really convince someone I would like to see the results.
A good start would be:
Performance - Matlab (.m) compiled (Matlab coder toolbox) X Intel Fortran
Performance - Simulink compiled (Realtime toolbox) X Intel Fortran
Does anyone have already tested this scenario?
Matlab code that I recently "compiled" using the Matlab Coder produced a speed-up of x20 (!). The actual expected speedup depends on many things. If your Matlab code is highly vectorized and uses mainly linear-algebra routines, then the Coder is unlikely to produce much speedup. But if you have multiple loops and conditionals in your algorithm then you can indeed achieve order-of-magnitude speedup as in my example above.
Under the hood, Matlab's linear-algebra uses BLAS/LAPACK (via the MKL/ACML libraries), that use highly-optimized Fortran code. So unless you write extremely efficient Fortran, it is not likely that you will be able to outperform Matlab (despite the function-call overheads) for highly-vectorized Matlab linear-algebra/math algos. However, if your code uses conditionals/loops and similar non-math programming constructs, then the picture might change. In short, there's no simple answer - it depends on your specific algorithm/program.
Putting performance aside for the moment, Matlab has numerous other benefits over Fortran, including a vast array of tested built-in functions and enabling a rapid development cycle.
You would need to ask a more tightly defined question - there's no single answer to whether Fortran is faster than MATLAB/Simulink.
First of all, it's easy to write terrible, slow algorithms in either language. So you'd need to specify particular, well-written algorithms.
Secondly, there are many things for which MATLAB will be faster than even very well-written Fortran (or C). For example, if you want to multiply two big matrices together, or calculate some eigenvalues, or other linear algebra that is in MATLAB's sweet spot, you won't beat it. On the other hand if you're doing something with a lot more logic, that can't be vectorised, Fortran is likely to be faster (as long as it's written well).
When you introduce MATLAB Coder into the picture, these latter things are the ones that are most likely to benefit from a speedup by converting to C code (mostly because the former things really can't be sped up much, which is why you wouldn't beat them). But the speedup is variable - I've seen over 10-15x, but also sometimes only 1-2x.
You don't mention where you found the charts you have comparing MATLAB to Fortran, but if you've found them on the internet I would think it's a pretty safe assumption that they don't involve C code generation with MATLAB Coder, and represent the performance of just MATLAB.
Finally - one other method of speeding up MATLAB is to parallelize it with Parallel Computing Toolbox (which enables you to parallelize things over the cores on your local machine) and possibly also with Distributed Computing Server (parallelization on cluster). It's typically a lot easier to do this with MATLAB code than it is to speed up by using MATLAB Coder to produce C code - so if you think it's critical to consider MATLAB Coder in your comparisons, you should probably also consider this as well.
MATLAB Compiler will not make your code faster, it is intended for distributing your code to third party users that do not have MATLAB. You need to provide, along with your compiled code, the MCR or MATLAB Component Runtime, which is essentially a headless version of MATLAB, and which you can distribute freely if you have a license of MATLAB Compiler.
Now, if you use MATLAB Coder (or Simulink Coder for Simulink) to generate C code from your MATLAB code, then it is likely that you will get a speed up compared to interpreted MATLAB code. Even then, that depends on the code in question. Also, this only supports a subset of the MATLAB language, that is compatible with C code generation.
I'm new to programming and learning Haskell by reading and working through Project Euler problems. Of course, the most important thing one can do to improve performance on these problems is to use a better algorithm. However, it is clear to me that there are other simple and easy to implement ways to improve performance. A cursory search brought up this question, and this question, which give the following tips:
Use the ghc flags -O2 and -fllvm.
Use the type Int, instead of Integer, because it is unboxed (or even Integer instead of Int64). This requires typing the functions, not letting the compiler decide on the fly.
Use rem, not mod, for division testing.
Use Schwartzian transformations when appropriate.
Using an accumulator in recursive functions (a tail-recursion optimization, I believe).
Memoization (?)
(One answer also mentions worker/wrapper transformation, but that seems fairly advanced.)
Question: What other simple optimizations can one make in Haskell to improve performance on Project Euler-style problems? Are there any other Haskell-specific (or functional programming specific?) ideas or features that could be used to help speed up solutions to Project Euler problems? Conversely, what should one watch out for? What are some common yet inefficient things to be avoided?
Here are some good slides by Johan Tibell that I frequently refer to:
Haskell Performance Patterns
One easy suggestion is to use hlint which is a program that checks your source code and makes suggestions for improvements syntax wise. This might not increase speed because most likely it's already done by the compiler or the lazy evaluation. But it might help the compiler in some cases. Further more it will make you a better Haskell programmer since you will learn better ways to do things, and it might be easier to understand your program and analyze it.
examples taken from http://community.haskell.org/~ndm/darcs/hlint/hlint.htm such as:
darcs-2.1.2\src\CommandLine.lhs:94:1: Error: Use concatMap
Found:
concat $ map escapeC s
Why not:
concatMap escapeC s
and
darcs-2.1.2\src\Darcs\Patch\Test.lhs:306:1: Error: Use a more efficient monadic variant
Found:
mapM (delete_line (fn2fp f) line) old
Why not:
mapM_ (delete_line (fn2fp f) line) old
I think the largest increases you can do in Project Euler problems is to understand the problem and remove unnecessary computations. Even if you don't understand everything you can do some small fixes which will make your program run twice the speed. Let's say you are looking for primes up to 1.000.000, then you of course can do filter isPrime [1..1000000]. But if you think a bit, then you can realize that well, no even number above is a prime, there you have removed (about) half the work. Instead doing [1,2] ++ filter isPrime [3,5..999999]
There is a fairly large section of the Haskell wiki about performance.
One fairly common problem is too little (or too much) strictness (this is covered by the sections listed in the General techniques section of the performance page above). Too much laziness causes a large number of thunks to be accumulated, too much strictness can cause too much to be evaluated.
These considerations are especially important when writing tail recursive functions (i.e. those with an accumulator); And, on that note, depending on how the function is used, a tail recursive function is sometimes less efficient in Haskell than the equivalent non-tail-recursive function, even with the optimal strictness annotations.
Also, as demonstrated by this recent question, sharing can make a huge difference to performance (in many cases, this can be considered a form of memoisation).
Project Euler is mostly about finding clever algorithmic solutions to the problems. Once you have the right algorithm, micro-optimization is rarely an issue, since even a straightforward or interpreted (e.g. Python or Ruby) implementation should run well within the speed constraints. The main technique you need is understanding lazy evaluation so you can avoid thunk buildups.
Here is a good linear solver named GotoBLAS. It is available for download and runs on most computing platforms. My question is, is there an easy way to link this solver with the Mathematica kernel, so that we can call it like LinearSolve? One thing most of you may agree on for sure is that if we have a very large Linear system then we better get it solved by some industry standard Linear solver. The inbuilt solver is not meant for really large problems.
Now that Mathematica 8 has come up with better compilation and library link capabilities we can expect to use some of those solvers from within Mathematica. The question is does that require little tuning of the source code, or you need to be an advanced wizard to do it. Here in this forum we may start linking some excellent open source programs like GotoBLAS with Mathematica and exchange our views. Less experienced people can get some insight from the pro users and at the end we get a much stronger Mathematica. It will be an open project for the ever increasing Mathematica community and a platform where these newly introduced capabilities of Mathematica 8 could be transparently documented for future users.
I hope some of you here will give solid ideas on how we can get GotoBLAS running from within Mathematica. As the newer compilation and library link capabilities are usually not very well documented, they are not used by the common users very often. This question can act as a toy example to document these new capabilities of Mathematica. Help in this direction by the experienced forum members will really lift the motivation of new users like me as well as it will teach us a very useful thing to extend Mathematica's number crunching arsenal.
The short answer, I think, is that this is not something you really want to do.
GotoBLAS, as I understand it, is a specific implementation of BLAS, which stands for Basic Linear Algebra Subroutines. "Basic" really means quite basic here - multiply a matrix times a vector, for example. Thus, BLAS is not a solver that a function like LinearSolve would call. LinearSolve would (depending on the exact form of the arguments) call a LAPACK command, which is a higher level package built on top of BLAS. Thus, to really link GotoBLAS (or any BLAS) into Mathematica, one would really need to recompile the whole kernel.
Of course, one could write a C/Fortran program that was compiled against GotoBLAS and then link that into Mathematica. The resulting program would only use GotoBLAS when running whatever specific commands you've linked into Mathematica, however, which rather misses the whole point of BLAS.
The Wolfram Kernel (Mathematica) is already linked to the highly-optimized Intel Math Kernel Library, and is distributed with Mathematica. The MKL is multithreaded and vectorized, so I'm not sure what GotoBLAS would improve upon.
This should be very simple but I could not find an exhaustive answer:
I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value)
Should I use CUBLAS with Sgemm function to calculate?
I need the maximum speed achievable so I thought of CUBLAS library which should be well-optimized
For any sort of technical computing, you should always use optimized libraries when available. Existing libraries, used by hundreds of other people, are going to be better tested and better optimized than anything you do yourself, and the time you don't spend writing (and debugging, and optimizing) that function yourself can be better spent working on the actual high-level problem you want to solve instead of re-discovering things other people have already implemented. This is just basic specialization of labour stuff; focus on the compute problem you want to solve, and let people who spend their days professionally writing GPGPU matrix routines do that for you.
Only when you are sure that existing libraries don't do what you need -- maybe they solve too general a problem, or make certain assumptions that don't hold in your case -- should you roll your own.
I agree with the others that in this particular case, the operation is pretty straightforward and it's feasible to DIY; but if you're going to be doing anything else with those matricies once you're done adding them, you'd be best off using optimized BLAS routines for whatever platform you're on.
What you want to do would be trivial to implement in CUDA and will be bandwidth limited.
And since CUBLAS5.0, cublasgeam can be used for that. It computes the weighted sum of 2 optionally transposed matrices.