Using code with underlying parallelism in OpenMDAO - parallel-processing

I am interested in adding an underlying parallelism to a couple of our OpenMDAO components. The bulk of the code in these components is written in Fortran. The Fortran code is wrapped in python and then used as python modules in OpenMDAO. I would like to run these Fortran codes in parallel using either OpenMP or OpenMPI. We are already planning to use the built-in parallel functionality of OpenMDAO, so this would be a second layer of parallelism. Would this be feasible? If so, do you have a recommended approach that would work well with OpenMDAO?

First I'll address the question about OpenMP. Currently OpenMDAO doesn't use OpenMP itself, and we don't have any plans to change that any time soon. So that means, we the framework doesn't really know or care if you happen to use it in your Fortran code at all. Feel free, with all the normal caveats about MPI + OpenMP codes in effect of course!
If you would like to use MPI parallelism in a component itself, that is directly supported by OpenMDAO. We have a fairly simple tutorial up for this situation, where the component itself would like to ask for multiple processors. Notable features of this tutorial are where the component asks the framework for more than one processor:
def get_req_procs(self):
"""
min/max number of cpus that this component can use
"""
return (1,self.size)
In this case, the component will accept anywhere from 1 proc, up to the number of elements in its array. In your case, you might want to restrict that to a single value, in which case you can return a single integer.
The other notable part is:
def setup_distrib_idxs(self):
"""
specify the local sizes of the variables and which specific indices this specific
distributed component will handle. Indices do NOT need to be sequential or
contiguous!
"""
comm = self.comm
rank = comm.rank
#NOTE: evenly_distrib_idxs is a helper function to split the array up as evenly as possible
sizes, offsets = evenly_distrib_idxs(comm.size, self.size)
local_size, local_offset = sizes[rank], offsets[rank]
self.local_size = int(local_size)
start = local_offset
end = local_offset + local_size
self.set_var_indices('x', val=np.zeros(local_size, float),
src_indices=np.arange(start, end, dtype=int))
self.set_var_indices('y', val=np.zeros(local_size, float),
src_indices=np.arange(start, end, dtype=int))
This code tells the framework how your distributed data gets split up across the many procs. The details of this methods are going to vary heavily from one implementation to the next. In some cases, you might have all procs have all the data. In others (like this one) you'll distribute the data evenly across the procs. In still other cases you might have a combination of global and distributed data.
If you planned to use only OpenMP, you would probably share all the data amongst all the processes, but still request more than 1 proc. That way you ensure that OpenMDAO allocated enough procs to your comp that it can be useful in a multi-threaded context. You'll be handed a comm object that you can work with to divy up the tasks.
If you planned to use purely MPI, its likely (though not certain) that you'll be working with distributed data. You'll still want to request more than 1 proc, but you'll also have to split up the data.
If you decide to use OpenMP and MPI, then likely some combination of distributed and shared data will be needed.

Related

OpenMDAO: Use Parallel Groups in order to run same computations i parallel, to make the optimization faster

I am writing this post in the hope to understand how to set parallel computations by using OpenMDAO. I have been told that OpenMDAO has an option called Parallel Groups (https://openmdao.org/newdocs/versions/latest/features/core_features/working_with_groups/parallel_group.html) and I am wondering if this option could help me to make a gradient-free optimizer able to run in parallel the computations of the function that it has to study.
Do you know if I can create 2 or 3 instances of the function that I am trying to optimize, and in that way make OpenMDAO able to run the instances of the function with differents chosen inputs, in order to find the optimal results with less time that if it had to work with only one function instance ?
I saw that this thread was closer to what I am trying to do: parallelize openmdao optimization with different initial guesses I think it could have brought me some answers, but it appears that the link proposed as an answer is not available anymore.
Many thanks in advance for your help
to start, You'll need to install MPI, MPI4py, PETSc, and PETSc4py. These can be installed without too much effort on linux. They are a little harder on OSx, and VERY hard on windows.
You can use parallel groups to run multiple components in parallel. Whether or not you can make use of that for a gradient free method though is a trickier question. Unfortunately, as of V3.17, none of the current gradient free drivers are set up to work that way.
You could very probably make it work, but it will require some development on your part. You'll need to find a way to map the "generation data" (using that GA term as a generic reference to the set of parallel cases you can run at once for a gradient free method). That will almost certainly involve setting up a for loop outside the normal OpenMDAO run method.
You would set up the model with n instances in parallel where n is equal to the size of a generation. Then write your own code around a call to run_model that would map the gradient free data down into that model to run the cases all at once.
I am essentially proposing that you forgo the driver API and write your own execution code around OpenMDAO. This modeling approach was prototyped in the 2020 Reverse Hackathon, where we discussed how the driver API is not strictly necessary.

MPI and message passing in Julia

I never used MPI before and now for my project in Julia I need to learn how to write my code in MPI and have several codes with different parameters run in parallel and from time to time send some data from each calculation to the other ones.
And I am absolutely blank how to do this in Julia and I never did it in any language before. I installed library MPI but didn't find good tutorial or documentation or an available example for that.
There are different ways to do parallel programming with Julia.
If your problem is very simply, then it might sufficient to use parallel for loops and shared arrays:
https://docs.julialang.org/en/v1/manual/parallel-computing/
Note however, you cannot use multiple computing nodes (such as a cluster) in this case.
To me, the other native constructs in Julia are difficult to work with for more complex programs and in my case, I needed to restructure (significantly) my serial code to use them.
The advantage of MPI is that you will find a lot of documentation of doing MPI-style (single-program, multiple-data) programming in general (but not necessarily documentation specific to julia). You might find the MPI style also more obvious.
On a large cluster it is also possible that you will find optimized MPI libraries.
A good starting points are the examples distributed with MPI.jl:
https://github.com/JuliaParallel/MPI.jl/tree/master/examples

For loop parallel in python

I am working in parallelization in python.. and I have big calculation need to be parallel.
at first I have big for loop for (1000 particles for example) so my process was not independent ,and I need independent process to make it parallel. so I divided the for loop in 2 FOR loops has calculate 500 ,500 particles..and I need to run this two independent loop parallel on two different cores(processors)...so is that possible?
if yes then how? please share some guidance..
for i in particle1
some processes
......
print ( something)
2nd loop
for i in particles2
someprocess....
print (something1)
and now i want this two different process in combine
so ...
print (something + something1)
this i am exatclty want to do.. please share idea..
There's two possibilities -- multithreading and multiprocessing. Multithreading is more convenient, because it allows you to automatically share the global state among all your worker threads, rather than explicitly transferring it around. But in Python, something called the "global interpreter lock" (GIL) makes it difficult for multiple threads to actually work in parallel; they tend to get blocked behind each other, with only one thread actually doing work at a time. Multiprocessing takes more setup, and you have to explicitly transfer data around (which can be expensive), but it's more effective at actually using multiple processors.
Both multithreading and multiprocessing can be leveraged by the concurrent.futures module, which gives you task-based parallelism. That might be tricky to get your head around, but once you do, it's the most efficient way to write multiprocessor-capable code.
Finally, if your for-loop is doing the same math operations on a bunch of data, you should look into numpy, and vectorizing your data. That's the most difficult approach, but it will also give you the best performance.

When not to use MPI

This is not a question on specific technical coding aspect of MPI. I am NEW to MPI, and not wanting to make a fool of myself of using the library in a wrong way, thus posting the question here.
As far as I understand, MPI is a environment for building parallel application on a distributed memory model.
I have a system that's interconnected with Infiniband, for the sole purpose of doing some very time consuming operations. I've already broke out the algorithm to do it in parallel, so I am really only using MPI to transmit data (results of the intermediate steps) between multiple nodes over Infiniband, which I believe one can simply use OpenIB to do.
Am I using MPI the right way? Or am I bending the original intention of the system?
Its fine to use just MPI_Send & MPI_Recv in your algorithm. As your algorithm evolves, you gain more experience, etc. you may find use for the more "advanced" MPI features such as barrier & collective communication such as Gather, Reduce, etc.
The fewer and simpler the MPI constructs you need to use to get your work done, the better MPI is a match to your problem -- you can say that about most libraries and lanaguages, as a practical matter and argualbly an matter of abstractions.
Yes, you could write raw OpenIB calls to do your work too, but what happens when you need to move to an ethernet cluster, or huge shared-memory machine, or whatever the next big interconnect is? MPI is middleware, and as such, one of its big selling points is that you don't have to spend time writing network-level code.
At the other end of the complexity spectrum, the time not to use MPI is when your problem or solution technique presents enough dynamism that MPI usage (most specifically, its process model) is a hindrance. A system like Charm++ (disclosure: I'm a developer of Charm++) lets you do problem decomposition in terms of finer grained units, and its runtime system manages the distribution of those units to processors to ensure load balance, and keeps track of where they are to direct communication appropriately.
Another not-uncommon issue is dynamic data access patterns, where something like Global Arrays or a PGAS language would be much easier to code.

Distributed array in MPI for parallel numerics

in many distributed computing applications, you maintain a distributed array of objects. Each process manages a set of objects that it may read and write exclusively and furthermore a set of objects that may only read (the content of which is authored by and frequently recerived from other processes).
This is very basic and is likely to have been done a zillion times until times until now - for example, with MPI. Hence I suppose there is something like an open source extension for MPI, which provides the basic capabilities of a distributed array for computing.
Ideally, it would be written in C(++) and mimic the official MPI standard interface style. Does anybody know anything like that? Thank you.
From what I gather from your question, you're looking for a mechanism for allowing a global view (read-only) of the problem space, but each process has ownership (read-write) of a segment of the data.
MPI is simply an API specification for inter-process communication for parallel applications and any implementation of it will work at a level lower than what you are looking for.
It is quite common in HPC applications to perform data decomposition in a way that you mentioned, with MPI used to synchronise shared data to other processes. However each application have different sharing patterns and requirements (some may wish to only exchange halo regions with neighbouring nodes, and perhaps using non-blocking calls to overlap communication other computation) so as to improve performance by making use of knowledge of the problem domain.
The thing is, using MPI to sync data across processes is simple but implementing a layer above it to handle general purpose distribute array synchronisation that is easy to use yet flexible enough to handle different use cases can be rather trickly.
Apologies for taking so long to get to the point, but to answer your question, AFAIK there isn't be an extension to MPI or a library that can efficiently handle all use cases while still being easier to use than simply using MPI. However, it is possible to to work above the level of MPI which maintaining distributed data. For example:
Use the PGAS model to work with your data. You can then use libraries such as Global Arrays (interfaces for C, C++, Fortran, Python) or languages that support PGAS such as UPC or Co-Array Fortran (soon to be included into the Fortran standards). There are also languages designed specifically for this form of parallelism, i,e. Fortress, Chapel, X10
Roll your own. For example, I've worked on a library that uses MPI to do all the dirty work but hides the complexity by providing creating custom data types for the application domain, and exposing APIs such as:
X_Create(MODE, t_X) : instantiate the array, called by all processes with the MODE indicating if the current process will require READ-WRITE or READ-ONLY access
X_Sync_start(t_X) : non-blocking call to initiate synchronisation in the background.
X_Sync_complete(t_X) : data is required. Block if synchronisation has not completed.
... and other calls to delete data as well as perform domain specific tasks that may require MPI calls.
To be honest, in most cases it is often simpler to stick with basic MPI or OpenMP, or if one exists, using a parallel solver written for the application domain. This of course depends on your requirements.
For dense arrays, see Global Arrays and Elemental (Google will find them for you).
For sparse arrays, see PETSc.
I know this is a really short answer, but there is too much documentation of these elsewhere to bother repeating it.

Resources