Julia - Parallel mathematical optimizers - parallel-processing

Im using GLPK with through Julia, and I need to optimize the same GLPK.Prob repeatedly.
What changes between each optimization is that some combination of variables gets fixed to 0
simply put in pseudocode
lp = GLPK.Prob()
variables_to_block = [[1,2,3], [2,3], [6,7], [19,100,111]...]
for i in variables_to_block
block_vars(lp, i)
simplex(lp)
restore_vars(lp, i)
end
when I then run this, it looks like CPU1 acts as a scheduler, staying in the 9-11% range, and the load on CPU3 and CPU4 alternates between 0 and 100%, though never at the same time...the load on CPU2 stays at 0%
This can take a bit of time and I'd like to use all cores
There is however, a bit of a hassle to use the parallel features of Julia, particularly for lp-models because they involve pointers so (as far as I know) they cannot be copied easily between cores
Is there a way to either set the GLPK solver binary (or something) to automatically try to fully utilize all cores? either by compiling GLPK in such a way, or any other way

As far as I know, GLPK is not multithreaded. If you must have a multithreaded solver, then consider using a newer one such as Gurobi or MOSEK. They have free academic licenses.
If commercial solvers are anathema to your purposes, then perhaps try the Splitting Cone Solver. It is released for free under the MIT License.
Otherwise, the suggestions in the comments (e.g. batch proessing) may be your only recourse.

Related

ode45 mex file runs slow

This is the function trial2 which creates the system of differential equations:
function xdot=trial2(t,x)
delta=0.1045;epsilon=0.0048685;
xdot=[x(2);(-delta-epsilon*cos(t))*x(1)-0.7*delta*abs(x(1))];
Then, I try and solve this using ode45:
[t,x]=ode45('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But, this takes approximately 15 minutes. So I tried to improve the run time by creating a mex file. I found an ode45 mex equivalent function ode45eq.c. Then, I tried to solve by:
mex ode45eq.c;
[t,x]=ode45eq('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But this seems to run even slower. What could be the reason and how can I improve the run time to 2–3 minutes? I have to perform a lot of these computations. Also, is it worth creating a standalone C++ file to solve the problem faster?
Also, I am running this on an Intel i5 processor 64-bit system with 8 gb RAM. How much gain in speed do you think I can get if I move to a better processor with say 16 gb RAM?
With current versions of Matlab, it's very unlikely that you'll see any performance improvement by compiling ode45 to C/C++ mex. In fact, the compiled version will almost certainly be slower as you found. There's a good reason that ode45 is written in pure Matlab, as opposed to being compiled down to a native C function: it has to call user functions written in Matlab on every iteration. Additionally, Matlab's ode45 is a very dynamic function that is capable of interacting with the Matlab environment in many ways during the course of integration (plotting output functions, event detection, interpolation, etc.). It's also probably more straightforward to safely handle dynamic memory allocation in Matlab, than in C.
Your C code calls your user function via mexCallMATLAB. This function is not really meant for repeated calls, especially if they transfer data back and forth. Doing what you're trying to do would likely require new mex APIs and possibly changes to the Matlab language.
If you want faster numerical integration, you're going to have to give up the convenience of being able to write your integrations functions (i.e., trial2 in your example) in Matlab. You'll need to "hard code" your integration functions and compile them along with the integration scheme itself. With a detailed knowledge about your problem, and decent programming skills, you can write a tight integration loop and it may be possible to achieve an order of magnitude speedup in some cases.
Lastly, your trial2 function has an absolute value as well as an oscillating trigonometric function in it. Is this differential equation stiff? Have you tried other solvers, e.g., ode15s? Compare the outputs even over a shorter time period. And you may find that you get a bit of a speed-up (~25% on my machine) if you use the modern way of passing function handles instead of strings to ode45:
[t,x] = ode45(#trial2,[0 10000000],[0;1]);
The trial2 function can still be in a separate M-file or it can be a sub-function in the same file with your call to ode45 (this file need to be a function file, not a script of course).
What you did is basically replacing the todays implementation with the implementation of 1993, you have shown that mathworks did a great job improving the performance of the ode45 solver.
I don't see any possibility to improve the performance in this case, you can assume that such a fundamental part of MATLAB like ode45 is implemented in a optimal way, replacing it with some other code to mex it is not the solution. Everything you could gain using a mex function is cutting of the overhead of input/output handling which is implemented in m. Probably less than 0.1s execution time.

Compiled Simulink/Matlab x Fortran - Performance

I have to prove to my client that Fortran is faster than Matlab/Simulink. He is considering migrating a code from fortran to Matlab. The code is mainly logic and "procedural" subroutines. It does not use any native matrix operations or mathematical functions (eigenvalues, non linear equations, etc)
I think that the question of who is faster is already answered considering several references over the internet and the "intrinsic characteristics" of each language, but I need concrete data.
All charts that I found compare Matlab/Simulink x Fortran but do not specify if the Matlab code is compiled or not (using matlab coder toolbox). I think that it is a critical issue.
I´m not saying that compiling the code will make matlab faster than fortran, but in order to really convince someone I would like to see the results.
A good start would be:
Performance - Matlab (.m) compiled (Matlab coder toolbox) X Intel Fortran
Performance - Simulink compiled (Realtime toolbox) X Intel Fortran
Does anyone have already tested this scenario?
Matlab code that I recently "compiled" using the Matlab Coder produced a speed-up of x20 (!). The actual expected speedup depends on many things. If your Matlab code is highly vectorized and uses mainly linear-algebra routines, then the Coder is unlikely to produce much speedup. But if you have multiple loops and conditionals in your algorithm then you can indeed achieve order-of-magnitude speedup as in my example above.
Under the hood, Matlab's linear-algebra uses BLAS/LAPACK (via the MKL/ACML libraries), that use highly-optimized Fortran code. So unless you write extremely efficient Fortran, it is not likely that you will be able to outperform Matlab (despite the function-call overheads) for highly-vectorized Matlab linear-algebra/math algos. However, if your code uses conditionals/loops and similar non-math programming constructs, then the picture might change. In short, there's no simple answer - it depends on your specific algorithm/program.
Putting performance aside for the moment, Matlab has numerous other benefits over Fortran, including a vast array of tested built-in functions and enabling a rapid development cycle.
You would need to ask a more tightly defined question - there's no single answer to whether Fortran is faster than MATLAB/Simulink.
First of all, it's easy to write terrible, slow algorithms in either language. So you'd need to specify particular, well-written algorithms.
Secondly, there are many things for which MATLAB will be faster than even very well-written Fortran (or C). For example, if you want to multiply two big matrices together, or calculate some eigenvalues, or other linear algebra that is in MATLAB's sweet spot, you won't beat it. On the other hand if you're doing something with a lot more logic, that can't be vectorised, Fortran is likely to be faster (as long as it's written well).
When you introduce MATLAB Coder into the picture, these latter things are the ones that are most likely to benefit from a speedup by converting to C code (mostly because the former things really can't be sped up much, which is why you wouldn't beat them). But the speedup is variable - I've seen over 10-15x, but also sometimes only 1-2x.
You don't mention where you found the charts you have comparing MATLAB to Fortran, but if you've found them on the internet I would think it's a pretty safe assumption that they don't involve C code generation with MATLAB Coder, and represent the performance of just MATLAB.
Finally - one other method of speeding up MATLAB is to parallelize it with Parallel Computing Toolbox (which enables you to parallelize things over the cores on your local machine) and possibly also with Distributed Computing Server (parallelization on cluster). It's typically a lot easier to do this with MATLAB code than it is to speed up by using MATLAB Coder to produce C code - so if you think it's critical to consider MATLAB Coder in your comparisons, you should probably also consider this as well.
MATLAB Compiler will not make your code faster, it is intended for distributing your code to third party users that do not have MATLAB. You need to provide, along with your compiled code, the MCR or MATLAB Component Runtime, which is essentially a headless version of MATLAB, and which you can distribute freely if you have a license of MATLAB Compiler.
Now, if you use MATLAB Coder (or Simulink Coder for Simulink) to generate C code from your MATLAB code, then it is likely that you will get a speed up compared to interpreted MATLAB code. Even then, that depends on the code in question. Also, this only supports a subset of the MATLAB language, that is compatible with C code generation.

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

Fastest math programming language?

I have an application that requires millions of subtractions and remainders, i originally programmed this algorithm inside of C#.Net but it takes five minutes to process this information and i need it faster than that.
I have considered perl and that seems to be the best alternative now. Vb.net was slower in testing. C++ may be better also. Any advice would be greatly appreciated.
You need a compiled language like Fortran, C, or C++. Other languages are designed to give you flexibility, object-orientation, or other advantages, and assume absolutely fastest performance is not your highest priority.
Know how to get maximum performance out of a single thread, and after you have done so investigate sharing the work across multiple cores, for example with MPI. To get maximum performance in a single thread, one thing I do is single-step it at the machine instruction level, to make sure it's not dawdling about in stuff that could be removed.
Some calculations are regular enough to take profit of GPGPUs: recent graphic cards are essentially specialized massively parallel numerical co-processors. For instance, you could code your numerical kernels in OpenCL. Otherwise, learn C++11 (not some earlier version of the C++ standard) or C. And in many cases Ocaml could be nearly as fast as C++ but much easier to code with.
Perhaps your problem can be handled by scilab or R, I did not understand it enough to help more.
And you might take advantage of your multi-core processor by e.g. using Pthreads or MPI
At last, the Linux operating system is perhaps better to deal with massive calculations. It is significant that most super computers use it today.
If execution speed is the highest priority, that usually means Fortran.
Try Julia: its killing feature is being easy to code in a high level concise way, while keeping performances at the same order of magnitude of Fortran/C.
PARI/GP is the best I have used so far. It's written in C.
Try to look at DMelt mathematical program. The program calls Java libraries. Java virtual machine can optimize long mathematical calculations for you.
The standard tool for mathmatic numerical operations in engineering is often Matlab (or as free alternatives octave or the already mentioned scilab).

Efficient EigenSolver Implementation

I am looking for an efficient eigensolver ( language not important, although I would be programming in C#), that utilizes the multi-core features found in modern CPU. Being able to work directly with pardiso solver is a major plus. My matrix are mostly sparse matrix, so an ideal solver should be able to take advantage of this fact and greatly enhance the memory usage and performance.
So far I have only found LAPACK and ARPACK. The LAPACK, as implemented in Intel MKL, is a good candidate, as it offers multi-core optimization. But it seems that the drivers inside the LAPACK don't work directly with pardiso solver, furthermore, it seems that they don't take advantage of sparse matrix ( but I am not sure on this point).
ARPACK, on the other hand, seems to be pretty hard to setup in Windows environment, and the parallel version, PARPACK, doesn't work so well. The bonus point is that it can work with pardiso solver.
The best would be Intel MKL + ARPACK with multi-core speedup. Not sure whether there is any existing implementations that already do what I want to do?
I'm working on a problem with needs very similar to the ones you state. I'm considering FEAST:
http://www.ecs.umass.edu/~polizzi/feast/index.htm
I'm trying to make it work right now, but it seems perfect. I'm interested in hearing what your experience with it is, if you use it.
cheers
Ned
Have a look at the Eigen2 library.
I've implemented it already, in C#.
The idea is that one must convert the matrix format in CSR format. Then, one can use MKL to compute linear equation solving algorithm ( using pardiso solver), the matrix-vector manipulation.

Resources