I am working in parallelization in python.. and I have big calculation need to be parallel.
at first I have big for loop for (1000 particles for example) so my process was not independent ,and I need independent process to make it parallel. so I divided the for loop in 2 FOR loops has calculate 500 ,500 particles..and I need to run this two independent loop parallel on two different cores(processors)...so is that possible?
if yes then how? please share some guidance..
for i in particle1
some processes
......
print ( something)
2nd loop
for i in particles2
someprocess....
print (something1)
and now i want this two different process in combine
so ...
print (something + something1)
this i am exatclty want to do.. please share idea..
There's two possibilities -- multithreading and multiprocessing. Multithreading is more convenient, because it allows you to automatically share the global state among all your worker threads, rather than explicitly transferring it around. But in Python, something called the "global interpreter lock" (GIL) makes it difficult for multiple threads to actually work in parallel; they tend to get blocked behind each other, with only one thread actually doing work at a time. Multiprocessing takes more setup, and you have to explicitly transfer data around (which can be expensive), but it's more effective at actually using multiple processors.
Both multithreading and multiprocessing can be leveraged by the concurrent.futures module, which gives you task-based parallelism. That might be tricky to get your head around, but once you do, it's the most efficient way to write multiprocessor-capable code.
Finally, if your for-loop is doing the same math operations on a bunch of data, you should look into numpy, and vectorizing your data. That's the most difficult approach, but it will also give you the best performance.
Related
I am writing this post in the hope to understand how to set parallel computations by using OpenMDAO. I have been told that OpenMDAO has an option called Parallel Groups (https://openmdao.org/newdocs/versions/latest/features/core_features/working_with_groups/parallel_group.html) and I am wondering if this option could help me to make a gradient-free optimizer able to run in parallel the computations of the function that it has to study.
Do you know if I can create 2 or 3 instances of the function that I am trying to optimize, and in that way make OpenMDAO able to run the instances of the function with differents chosen inputs, in order to find the optimal results with less time that if it had to work with only one function instance ?
I saw that this thread was closer to what I am trying to do: parallelize openmdao optimization with different initial guesses I think it could have brought me some answers, but it appears that the link proposed as an answer is not available anymore.
Many thanks in advance for your help
to start, You'll need to install MPI, MPI4py, PETSc, and PETSc4py. These can be installed without too much effort on linux. They are a little harder on OSx, and VERY hard on windows.
You can use parallel groups to run multiple components in parallel. Whether or not you can make use of that for a gradient free method though is a trickier question. Unfortunately, as of V3.17, none of the current gradient free drivers are set up to work that way.
You could very probably make it work, but it will require some development on your part. You'll need to find a way to map the "generation data" (using that GA term as a generic reference to the set of parallel cases you can run at once for a gradient free method). That will almost certainly involve setting up a for loop outside the normal OpenMDAO run method.
You would set up the model with n instances in parallel where n is equal to the size of a generation. Then write your own code around a call to run_model that would map the gradient free data down into that model to run the cases all at once.
I am essentially proposing that you forgo the driver API and write your own execution code around OpenMDAO. This modeling approach was prototyped in the 2020 Reverse Hackathon, where we discussed how the driver API is not strictly necessary.
I am interested in adding an underlying parallelism to a couple of our OpenMDAO components. The bulk of the code in these components is written in Fortran. The Fortran code is wrapped in python and then used as python modules in OpenMDAO. I would like to run these Fortran codes in parallel using either OpenMP or OpenMPI. We are already planning to use the built-in parallel functionality of OpenMDAO, so this would be a second layer of parallelism. Would this be feasible? If so, do you have a recommended approach that would work well with OpenMDAO?
First I'll address the question about OpenMP. Currently OpenMDAO doesn't use OpenMP itself, and we don't have any plans to change that any time soon. So that means, we the framework doesn't really know or care if you happen to use it in your Fortran code at all. Feel free, with all the normal caveats about MPI + OpenMP codes in effect of course!
If you would like to use MPI parallelism in a component itself, that is directly supported by OpenMDAO. We have a fairly simple tutorial up for this situation, where the component itself would like to ask for multiple processors. Notable features of this tutorial are where the component asks the framework for more than one processor:
def get_req_procs(self):
"""
min/max number of cpus that this component can use
"""
return (1,self.size)
In this case, the component will accept anywhere from 1 proc, up to the number of elements in its array. In your case, you might want to restrict that to a single value, in which case you can return a single integer.
The other notable part is:
def setup_distrib_idxs(self):
"""
specify the local sizes of the variables and which specific indices this specific
distributed component will handle. Indices do NOT need to be sequential or
contiguous!
"""
comm = self.comm
rank = comm.rank
#NOTE: evenly_distrib_idxs is a helper function to split the array up as evenly as possible
sizes, offsets = evenly_distrib_idxs(comm.size, self.size)
local_size, local_offset = sizes[rank], offsets[rank]
self.local_size = int(local_size)
start = local_offset
end = local_offset + local_size
self.set_var_indices('x', val=np.zeros(local_size, float),
src_indices=np.arange(start, end, dtype=int))
self.set_var_indices('y', val=np.zeros(local_size, float),
src_indices=np.arange(start, end, dtype=int))
This code tells the framework how your distributed data gets split up across the many procs. The details of this methods are going to vary heavily from one implementation to the next. In some cases, you might have all procs have all the data. In others (like this one) you'll distribute the data evenly across the procs. In still other cases you might have a combination of global and distributed data.
If you planned to use only OpenMP, you would probably share all the data amongst all the processes, but still request more than 1 proc. That way you ensure that OpenMDAO allocated enough procs to your comp that it can be useful in a multi-threaded context. You'll be handed a comm object that you can work with to divy up the tasks.
If you planned to use purely MPI, its likely (though not certain) that you'll be working with distributed data. You'll still want to request more than 1 proc, but you'll also have to split up the data.
If you decide to use OpenMP and MPI, then likely some combination of distributed and shared data will be needed.
I'm looking for some canonical, simple concurrency problems, suitable for demonstrating usage of a library for concurrent computations I'm working on.
To clarify what I mean by "concurrency": I'm interested in algorithms that utilize non-deterministic communicating processes, not in e.g. making algorithms like quicksort run faster by spreading the work over multiple processors. This is how I'm using the term.
I know about the Dining Philosophers Problem, and that would be acceptable, but I wonder whether there are any more convincing but equally simple problems.
Producer-Consumer problem.
I generally use a simple "bank account transfer" scenario. For example I posted one such trivial case in this question on transactions.
It's a good case for exposition because:
Everyone understands the business problem.
It emphasises the importances of transactions in a concurrent environment.
You can easily extend the scenario (e.g. what if you want to calculate the sum of all current account balances while transactions are taking place?)
To demonstrate your concurrency library, you could probably start a thread running millions of transactions in this kind of scenario, and demonstrate how other threads can still see a consistent view of the world etc.
I don't think there is a standard first program for demonstrating that concurrency is working, like "Hello world" for sequential programs.
More typical for concurrency are programs that demonstrate problems, for example concurrent counters that lose some counts without proper synchronization. Or random transfers between bank accounts that cause a deadlock if locking is done naively. (I did these when playing with Java concurrency.)
One thing that demonstrates concurrency and is relatively simple is to count cooperatively: The concurrent threads (or whatever) have an internal counter, which they send out to each other, and set to what they receive plus one. (I did that with three LEGO Mindstorms RCX over infrared some years ago, worked nicely.)
BTW: The "Hello world" of the embedded programmer is the blinking LED.
There used to be a sample Java applet (quite possibly still is) that you used to test what scheduling algorithm your JVM and underlying OS use. It animated two (or optionally more? can't remember) bars gradually filling up, each animated by a different thread at the same priority.
An equivalent that prints:
red 1
red 2
green 1
red 3
green 2
etc to the console, seems to me to be the closest thing in spirit to the bare bones nature of "hello, world". That is, "can I make the computer do something useless but visible?"
So in each thread you'd want a series of pauses (either busy-loops or sleeps, up to you, and which you choose might affect the output depending how your concurrency is scheduled), each followed by some output. You might want to synchronize the output -- not really essential, but if a line were to be broken up by the scheduler it would be awkward to read.
Then if your concurrency model is co-operative (either neolithic threads, or perhaps something co-routine-based), you have to add suitable yields as well, to prevent the red bar filling before the green bar starts. That tells you that you've successfully made your concurrent code interleave.
You can ray trace "Hello" and "World" in separate threads.
Or animate "Hello" while "World" is raytracing.
I'm looking at evolving ants capable of food foraging behaviour using genetic programming, as described by Koza here. Each time step, I loop through each ant, executing its computer program (the same program is used by all ants in the colony). Currently, I have defined simple instructions like MOVE-ONE-STEP, TURN-LEFT, TURN-RIGHT, etc. But I also have a function PROGN that executes arguments in sequence. The problem I am having is that because PROGN can execute instructions in sequence, it means an ant can do multiple actions in a single time step. Unlike nature, I cannot run the ants in parallel, meaning one ant might go and perform several actions, manipulating the environment whilst all of the other ants are waiting to have their turn.
I'm just wondering, is this how it is normally done, or is there a better way? Koza does not seem to mention anything about it. Thing is, I want to expand the scenario to have other agents (e.g. enemies), which might rely on things occurring only once in a single time step.
I am not familiar with Koza's work, but I think a reasonable approach is to give each ant its own instruction queue that persists across time steps. By doing this, you can get the ants to execute PROGN functions one instruction per time step. For instance, the high-level logic for the time step of an ant can be:
Do-Time-Step(ant):
1. if ant.Q is empty: // then put the next instruction(s) into the queue
2. instructions <- ant.Get-Next-Instructions()
3. for instruction in instructions:
4. ant.Q.enqueue(instruction)
5. end for
6. end if
7. instruction <- ant.Q.dequeue() // get the next instruction in the queue
8. ant.execute(instruction) // have that ant do its job
Another similar approach to queuing instructions would be to preprocess the set of instructions an expand instances of PROGN to the set of component instructions. This would have to be done recursively if you allow PROGNs to invoke other PROGNs. The downside to this is that the candidate programs get a bit bloated, but this is only at runtime. On the other hand, it is easy, quick, and pretty easy to debug.
Example:
Say PROGN1 = {inst-p1 inst-p2}
Then the candidate program would start off as {inst1 PROGN1 inst2} and would be expanded to {inst1 inst-p1 inst-p2 inst2} when it was ready to be evaluated in simulation.
It all depends on your particular GP implementation.
In my GP kernel programs are either evaluated repeatedly or in parallel - as a whole, i.e. the 'atomic' operation in this scenario is a single program evaluation.
So all individuals in the population are repeated n times sequentially before evaluating the next program or all individuals are executed just once, then again for n times.
I've had pretty nice results with virtual agents using this level of concurrency.
It is definitely possible to break it down even more, however at that point you'll reduce the scalability of your algorithm:
While it is easy to distribute the evaluation of programs amongst several CPUs or cores it'll be next to worthless doing the same with per-node evaluation just due to the amount of synchronization required between all programs.
Given the rapidly increasing number of CPUs/cores in modern systems (even smartphones) and the 'CPU-hunger' of GP you might want to rethink your approach - do you really want to include move/turn instructions in your programs?
Why not redesign it to use primitives that store away direction and speed parameters in some registers/variables during program evaluation?
The simulation step then takes these parameters to actually move/turn your agents based on the instructions stored away by the programs.
evaluate programs (in parallel)
execute simulation
repeat for n times
evaluate fitness, selection, ...
Cheers,
Jay
I'm looking make use of the advantages of parallel programming in linq by using plinq, im not sure I understand the use entirely apart from the fact its going to make use of all cpu cores more efficiently so for a large query it might be quicker. Can I just simply call AsParallel() on linq calls to make use of th eplinq functionality and it will always be quicker? Or should I only use it when there is a lot of data to query or process?
You can't just assume that execution in parallel is always faster. It depends. In some situations you will gain a lot on multi-core processors by doing things in parallel. In other cases, you will just slow the things down, since parallel loops have a small overheat over simple loops.
For example, see my other answer which explains why embedded parallel loops can be a disaster.
Now, the best way to know if it is a good idea to use parallel loop in a precise context is to test both parallel and not parallel implementations and to measure the time they take.
To further add to the answer, it also depends on your data. Going a little 'old school' for a moment you could go down the road of loop unrolling, using for instead of foreach and so on and so forth.
However, you really need to ensure you aren't micro-optimising. Depending on your data fetches and the size of data (certainly with paged data) then you can probably get away with not using it.
That's not to say that making your linq mult-core aware isn't cool. But be aware of the setup costs of doing something like that and so be able to weigh up the benefits against the complexities of maintaining and debugging that code.
If your algorithm is already top notch then looking at the plinq extensions, a map reduce mechanism or similar may be the way to go. But first check your algorithm and your overall benefits. Operating on the right kind of collection (etc) in the right kind of way will always bring its own benefits (and problems!).
What are you trying to solve?