parallel matrix multiplication using PBLAS - parallel-processing

I have searched a lot of websites and resources but couldn't find any C or FORTRAN code example of parallel matrix multiplication using PBLAS PDGEMM function, could you please help me to find such resources.
Thank you in advance.
I have got an example of pblas.tar.gz from netlib website, did the make and executed it on Linux cluster using mpi but the program is executing the same run on all nodes without splitting the matrices.

A classic case would be the ScaLAPACK software and associated examples, such as http://www.netlib.org/scalapack/examples/example1.f
In case you misunderstood, PDGEMM will not "split the matrices", it expects the input data to already be distributed properly (i.e., 2D block-cyclic distribution).

Related

OpenMDAO: Use Parallel Groups in order to run same computations i parallel, to make the optimization faster

I am writing this post in the hope to understand how to set parallel computations by using OpenMDAO. I have been told that OpenMDAO has an option called Parallel Groups (https://openmdao.org/newdocs/versions/latest/features/core_features/working_with_groups/parallel_group.html) and I am wondering if this option could help me to make a gradient-free optimizer able to run in parallel the computations of the function that it has to study.
Do you know if I can create 2 or 3 instances of the function that I am trying to optimize, and in that way make OpenMDAO able to run the instances of the function with differents chosen inputs, in order to find the optimal results with less time that if it had to work with only one function instance ?
I saw that this thread was closer to what I am trying to do: parallelize openmdao optimization with different initial guesses I think it could have brought me some answers, but it appears that the link proposed as an answer is not available anymore.
Many thanks in advance for your help
to start, You'll need to install MPI, MPI4py, PETSc, and PETSc4py. These can be installed without too much effort on linux. They are a little harder on OSx, and VERY hard on windows.
You can use parallel groups to run multiple components in parallel. Whether or not you can make use of that for a gradient free method though is a trickier question. Unfortunately, as of V3.17, none of the current gradient free drivers are set up to work that way.
You could very probably make it work, but it will require some development on your part. You'll need to find a way to map the "generation data" (using that GA term as a generic reference to the set of parallel cases you can run at once for a gradient free method). That will almost certainly involve setting up a for loop outside the normal OpenMDAO run method.
You would set up the model with n instances in parallel where n is equal to the size of a generation. Then write your own code around a call to run_model that would map the gradient free data down into that model to run the cases all at once.
I am essentially proposing that you forgo the driver API and write your own execution code around OpenMDAO. This modeling approach was prototyped in the 2020 Reverse Hackathon, where we discussed how the driver API is not strictly necessary.

Julia package for managing distributed SpGEMM

Does anyone know a performant package under Juia to compute sparse matrix-matrix multuplication (SpGEMM) on a distributed cluster (MPI)?
I'm not sure if Elemental.jl is able to manage such computations.
I'm looking for something simple (such as COSMA.jl for dense systems), all help would be welcome...
Thanks
Elemental does appear to be able to handle this. In particular, using Elemental.jl you should be able to create a sparse distributed array with Elemental.DistSparseMatrix, which you should AFAICT be able to multiply with mul! or similar.
This does not not appear to be extensively documented, and in particular filling this DistSparseMatrix with the desired values does not appear to be trivial, but some examples appear in https://github.com/JuliaParallel/Elemental.jl/blob/master/test/lav.jl, among a few other places in the package source
Beyond this, while there are of course packages such as DistributedArrays.jl and the SparseArrays stdlib, there is not to my knowledge any sparse, distributed array package in pure Julia yet, so a wrapper package like Elemental.jl is going to be your best bet.
Other packages that should be generically capable of sparse distributed matrix multiplication appear to include PETSc and Trilinos, both of which do have Julia wrappers (the latter of which appears unmaintained, though see also the presentation thereon). With PETSc.jl, it appears that you should be able to make a "MATSEQ" sparse matrix by passing a Julia SparseMatrixCSC to PETSc.Mat.

MPI and message passing in Julia

I never used MPI before and now for my project in Julia I need to learn how to write my code in MPI and have several codes with different parameters run in parallel and from time to time send some data from each calculation to the other ones.
And I am absolutely blank how to do this in Julia and I never did it in any language before. I installed library MPI but didn't find good tutorial or documentation or an available example for that.
There are different ways to do parallel programming with Julia.
If your problem is very simply, then it might sufficient to use parallel for loops and shared arrays:
https://docs.julialang.org/en/v1/manual/parallel-computing/
Note however, you cannot use multiple computing nodes (such as a cluster) in this case.
To me, the other native constructs in Julia are difficult to work with for more complex programs and in my case, I needed to restructure (significantly) my serial code to use them.
The advantage of MPI is that you will find a lot of documentation of doing MPI-style (single-program, multiple-data) programming in general (but not necessarily documentation specific to julia). You might find the MPI style also more obvious.
On a large cluster it is also possible that you will find optimized MPI libraries.
A good starting points are the examples distributed with MPI.jl:
https://github.com/JuliaParallel/MPI.jl/tree/master/examples

ode45 mex file runs slow

This is the function trial2 which creates the system of differential equations:
function xdot=trial2(t,x)
delta=0.1045;epsilon=0.0048685;
xdot=[x(2);(-delta-epsilon*cos(t))*x(1)-0.7*delta*abs(x(1))];
Then, I try and solve this using ode45:
[t,x]=ode45('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But, this takes approximately 15 minutes. So I tried to improve the run time by creating a mex file. I found an ode45 mex equivalent function ode45eq.c. Then, I tried to solve by:
mex ode45eq.c;
[t,x]=ode45eq('trial2',[0 10000000],[0;1]);
plot(t,x(:,1),'r');
But this seems to run even slower. What could be the reason and how can I improve the run time to 2–3 minutes? I have to perform a lot of these computations. Also, is it worth creating a standalone C++ file to solve the problem faster?
Also, I am running this on an Intel i5 processor 64-bit system with 8 gb RAM. How much gain in speed do you think I can get if I move to a better processor with say 16 gb RAM?
With current versions of Matlab, it's very unlikely that you'll see any performance improvement by compiling ode45 to C/C++ mex. In fact, the compiled version will almost certainly be slower as you found. There's a good reason that ode45 is written in pure Matlab, as opposed to being compiled down to a native C function: it has to call user functions written in Matlab on every iteration. Additionally, Matlab's ode45 is a very dynamic function that is capable of interacting with the Matlab environment in many ways during the course of integration (plotting output functions, event detection, interpolation, etc.). It's also probably more straightforward to safely handle dynamic memory allocation in Matlab, than in C.
Your C code calls your user function via mexCallMATLAB. This function is not really meant for repeated calls, especially if they transfer data back and forth. Doing what you're trying to do would likely require new mex APIs and possibly changes to the Matlab language.
If you want faster numerical integration, you're going to have to give up the convenience of being able to write your integrations functions (i.e., trial2 in your example) in Matlab. You'll need to "hard code" your integration functions and compile them along with the integration scheme itself. With a detailed knowledge about your problem, and decent programming skills, you can write a tight integration loop and it may be possible to achieve an order of magnitude speedup in some cases.
Lastly, your trial2 function has an absolute value as well as an oscillating trigonometric function in it. Is this differential equation stiff? Have you tried other solvers, e.g., ode15s? Compare the outputs even over a shorter time period. And you may find that you get a bit of a speed-up (~25% on my machine) if you use the modern way of passing function handles instead of strings to ode45:
[t,x] = ode45(#trial2,[0 10000000],[0;1]);
The trial2 function can still be in a separate M-file or it can be a sub-function in the same file with your call to ode45 (this file need to be a function file, not a script of course).
What you did is basically replacing the todays implementation with the implementation of 1993, you have shown that mathworks did a great job improving the performance of the ode45 solver.
I don't see any possibility to improve the performance in this case, you can assume that such a fundamental part of MATLAB like ode45 is implemented in a optimal way, replacing it with some other code to mex it is not the solution. Everything you could gain using a mex function is cutting of the overhead of input/output handling which is implemented in m. Probably less than 0.1s execution time.

Integrating ODEs on the GPU using boost and python

I posted here not too long ago about a model I am trying to build using pycuda which solves About 9000 coupled ODEs. My model is too slow however and an SO member suggested that memory transfers from host to GPU is probably the culprit.
Right now cuda is being used only to calculate the rate of change of each of the 9000 species I am dealing with. Since I am passing in an array from the host to the GPU to perform this calculation and returning an array from the GPU to integrate on the host I can see how this would slow things down.
Would boost be the solution to my problem? From what I read, boost allows interoperability between c++ and python. It also includes c++ odeint , which I read, partnered with thrust allows quick reduction and integration all on the GPU. Is my understanding correct?
Thank you,
Karsten
Yes, boost.odeint and boost.python should solve your problem. You can use odeint with Thrust. There are also some OpenCL libraries (VexCL, ViennaCL) which might be easier to use then Thrust. Have a look at thist paper for a comparions and for use cases of odeint on GPUs.
Boost.python can do the communication between the C++ application and Python. Another approach would be a very slim command line application for solving the ODE (using boost.odeint) and which is entirely controlled by your python application.

Resources