Parallelization of a loop which involves random sampling using mpi4py

Parallelization of a loop which involves random sampling using mpi4py - parallel-processing

I am new to parallelization and MPI. I am learning and experimenting with mpi4py. Currently I am trying to optimize the performance of a method which randomly samples for a desired point(satisfying a condition) in an interval.
To give you a detailed idea, i created a sample program which is similar to my program. The aim of this program is to output 20 numbers between 9.9999 and 10.0. This is done by randomly sampling from [0.0,1.0] and multiplying it by 10(varies by iteration by a infinitesimally small amount).
The following is the function and comments are provided.
import numpy as np
import time
def fit(iterations, value):
array = []
sample = None
# This runs for a fixed number of iterations. For one iteration one sample needs to go to the accumulator array (in this case i.e array)
for iteration in range(iterations):
while True:
arbit = np.random.uniform(0,1)
# The main condition for sampling. In my original program this is bound to change with each
# iteration. so I made it depend on the iteration in this test program.
if((10-0.000001*(iteration))*arbit > value):
sample = 10*arbit
break
# The accumulation of accepted sample. If there are n iterations, n samples are expected.
array.append(sample)
print "Result: "+ str(array)
if __name__ == '__main__':
startTime = time.time()
fit(20, 9.9999)
elapsedTime = time.time() - startTime
print "\n"+"time taken: "+str(elapsedTime)+"\n"
As you can see all the action happens in the while loop in the fit() method. What I want to do is to take advantage of parallelization and mpi4py to make this method faster. For example, I want to start n processes and when the while loop comes the processes are fired parallely and which ever one finds the desired value first I want to take that value add it to accumulator array and abort all other processes. I want to continue this routine again in the next iteration and so on until the method finishes. Is something like this possible ? If not this way, What other way can I use parallelization in the above function ?
Thanks

The general ideas behind parallelization are heavily application-dependent. The more independent you can make your processes, the better. Inter-process communication adds hassle and overhead. This is especially true if your processes reside in different computers.
With your sample code above the simple way to make it parallel would be to split it by iterations. You would have a list of iterations and a number of worker processes which would churn through one iteration cycle at a time. Even if you needed to have the results in order, you can sort them afterwards. So, it does not really matter, if you go through iterations 0, 1, 2, 3... or e.g. 17, 3, 14, 1, 5...
But what you seem to suggest is that you split each iteration cycle into parallel loops looking for a suitable number. That is possible (but make sure you use different random seeds in different processes, otherwise they are replicating the same sequence), and the communication needed is very simple:
worker processes need to be able to send "I found it!"
worker processes need to stop when another process sends "I found it!"
worker processes need to be able to fetch a new starting value after they are done
There are several ways to accomplish this. The description above assumes the workers are active, but it is often easier to make stupid workers which only indicate when they are done and start doing things when they are told to. In that case you only need point-to-point communication between the master and the slaves.
In any case the workers have to poll the communication regularly when they are doing their work, and from the performance point of view the polling interval is important. If you poll very frequently, you lose time polling. If your poll interval is very long, then you lose time when something happens. The situation is easier with the master which can use blocking communication and just sit and wait until the workers say something.
So, you may use broadcasts between workers, or you may use master-slave communication, or you may use a combination of these. There are pros and cons in each approach, and the optimal solution depends on your application and requirements. (I usually pick the solution which is simplest to write and optimize only if there is a clear need.)
I am only superficially familiar with MPI, but the answer to your question is "yes", it can be done with MPI.

Related

Algorithm to find the correct combination in a fast manner

Assume I have 4 lists: jobs, workers, mechanisms and mechanism equipment.
Currently I am firstly looping through jobs, lets call this the
jobLoop.
Inside jobLoop I'm looping through workerLoop, checking if
worker is available and has required competences to do the job.
If worker is OK, I'm looping through mechanismLoop, checking if worker can use the mechanism and if the mechanism is available. If no mechanisms are available, I fall back to workerLoop, looking for another correct worker.
If mechanism is OK, I loop through mechEquipmentLoop, checking checking if worker can use the equipment and if the equipment is available. If no equipment is available, I fall back to mechanismLoop, looking for another correct mechanism.
If mechanism equipment is finally okay, the algorithm is done. If not, the algorithm says items cannot be matched.
This is a simplified version, on each step there are many checks like if worker is allowed on the object where the job is done and so on.
I'm trying to think of a more efficient way to do this. Currently the time complexity for this algorithm should be roughly O(n^4), right? I'm not looking for code, just guidance on how to perform this.

IMHO - this algorithm is O(jwm*e) instead of O(n^4). j = number of jobs, w = number of workers, m= number of mechanisms, e = number of mech equipment.
If these lists don't change & if the answer is needed only once this is the best algorithm to execute. You need to visit all the inputs once at least.
Suppose these lists change and the same algo needs to execute for a given job multiple times you can do this.
Store the list of workers in BST (or a self-balanced tree-like AVL) with job competency as key. Suppose if there are multiple competencies for a worker then his data will be in all the competencies. The creation of the tree is O(wlogw) here w is the number of unique competency & worker combinations, not workers number alone. Deletion, Addition & Search will be O(logw). Here are we are assuming that competency - worker distribution is decent. Suppose if all workers have only one competency this will become O(w) once again.
The same applies to mechanism and equipment. This will make the search at every level to O(logm) and O(loge).
So for every job best case scenario of allocation is O(logw * logm * loge), with a overhead of O(wlogw + mlogm + eloge). For all jobs it will be (j * logw * logm * loge).

RNG seed selection highly affects simulation otucomes

I'm developing for the first time a simulation in Omnet++ 5.4 that makes use of the queueinglib library available. In particular, I've built a simple model that comprehends a server and some passive queues.
I've repeated the simulation different times, setting different seeds and parameters, as written in the Omnet Simulation Manual, here is my omnetpp.ini:
# specify how many times a run needs to be repeated
repeat = 100
# default: different seed for each run
# seed-set = ${runnumber}
seed-set = ${repetition} # I tried both lines
OppNet.source.interArrivalTime = exponential(${10,5,2.5}s)
This produces 300 runs, 100 repetitions for each value of the interArrivalTime distribution parameter.
However, I'm observing some "strange" behaviour, namely, the resulting statistics are highly variable according to the RNG seed.
For example, considering the lengths of the queues in my model, I have in the majority of the runs values smaller than 10, while in a few others a mean value that differs in orders of magnitude (85000, 45000?).
Does this mean that my implementation is wrong? Or is it possible that the random seed selection could influence the simulation outcomes so heavily?
Any help or hint is appreciated, thank you.

I cannot rule out that your implementation is wrong without seeing it, but it is entirely possible that you just happened to configure a chaotic scenario.
And in that case, yes, any minor change in the inputs (in this case the PRNG seed) might cause a major divergence in the results.
EDIT:
Especially considering a given (non-trivial) queuing network, if you are varying the "load" (number/rate of incoming jobs/messages, with some random distribution), you might observe different "regimes" in the results:
with very light load, there is barely any queuing
with a really heavy load (close, or at max capacity), most queues are almost always loaded, or even growing constantly without limit
and somewhere in between the two, you might get this erratic behavior, where sometimes a couple of queues suddenly grow really deep, then are emptied, then different queues are loaded up, and so on; as a function of either time, or the PRNG seed, or both - this is what I mean by chaotic
But this is just speculation...

Nobody can say whether your implementation is right or wrong without seeing it. However, there are some general rules that apply to queues which you should be aware of. You say that you're varying the interArrivalTime distribution parameter. A very important concept in queueing is the traffic intensity, which is the ratio of the interarrival rate to the service rate. If that ratio is less than one, the line length can vary by a great deal but in the long run there will be time intervals when the queue empties out because on average the server can handle more customers than arrive. This is a stable queue. However, if that ratio is greater than one, the queue will grow unboundedly. The longer you run the system, the longer the line will get. The thing that surprises many people is that the line will also go to infinity asymptotically when traffic intensity is equal to one.
The other thing to know is that the closer the traffic intensity gets to one for a stable queue, the greater the variability is. That's because the average increases, but there will always be periods of line length zero as described above. In order to have zeros always present but have the average increasing, there must be times where the queue length gets above the average, which implies the variability must be increasing. Changing the random number seed gives you some visibility on the magnitude of the variance that's possible at any slice of time.
Bottom line here is that you may just be seeing evidence that queues are weirder and more variable than you thought.

Is it possible to come up with a distributed / multi core implementation of a prime sieve.

I have been working on prime sieve algorithm, and the basic implementation is working fine for me. What I am currently struggling with is a way to divide and distribute the calculation on to multiple processors.
I know it would require storage of the actual sieve in a shared memory area or a text file, but how would one go about dividing the calculation related steps.
Any lead would help. Thanks!

Split the numbers into sections of equal size, each processor will be responsible for one of these sections.
Another processor (or one of the processors) will generate the numbers of which multiple needs to be crossed-off. And pass this number to each other processors.
Each of the processors will then use the remainder of the section size divided by the given number and its own section index to determine the offset into its own section, and then loop through and cross off the applicable numbers.
Alternatively, one could get a much simpler approach by just using shared memory.
Let the first processor start crossing off multiple of 2, the second multiples of 3, the third multiples of 5, etc.
Essentially just let each processor grab the next number from the array and run with it.
If you don't do this well, you may end up with the third crossing off multiples of 4, since the first didn't get to 4 yet when the third started, so it's not crossed off, but it shouldn't result in too much more work - it will take increasingly longer for a multiple of some prime to be grabbed by a processor, while it will always be the first value crossed off by a processor handling that prime, so the likelihood of this redundancy happening decreases very quickly.
Using shared memory like this tends to be risky - if you plan on using one bit per index, most languages don't allow you to work on that level, and you'll end up needing to do some bitwise operations (probably bitwise-AND) on a few bytes to make your desired changes (although this complexity might be hidden in some API), and many languages will also not have this operation be a so-called atomic operation, meaning one thread can get a value, AND it, and write it back, and another can come in and get the value before the first thread wrote it, AND it, and write it back after the first thread's write, essentially causing the first thread's changes to be lost. There's no simple, efficient fix for this - what exactly you need to do will depend on the language.

About random number sequence generation

I am new to randomized algorithms, and learning it myself by reading books. I am reading a book Data structures and Algorithm Analysis by Mark Allen Wessis
.
Suppose we only need to flip a coin; thus, we must generate a 0 or 1
randomly. One way to do this is to examine the system clock. The clock
might record time as an integer that counts the number of seconds
since January 1, 1970 (atleast on Unix System). We could then use the
lowest bit. The problem is that this does not work well if a sequence
of random numbers is needed. One second is a long time, and the clock
might not change at all while the program is running. Even if the time
were recorded in units of microseconds, if the program were running by
itself the sequence of numbers that would be generated would be far
from random, since the time between calls to the generator would be
essentially identical on every program invocation. We see, then, that
what is really needed is a sequence of random numbers. These numbers
should appear independent. If a coin is flipped and heads appears,
the next coin flip should still be equally likely to come up heads or
tails.
Following are question on above text snippet.
In above text snippet " for count number of seconds we could use lowest bit", author is mentioning that this does not work as one second is a long time,
and clock might not change at all", my question is that why one second is long time and clock will change every second, and in what context author is mentioning
that clock does not change? Request to help to understand with simple example.
How author is mentioning that even for microseconds we don't get sequence of random numbers?
Thanks!

Programs using random (or in this case pseudo-random) numbers usually need plenty of them in a short time. That's one reason why simply using the clock doesn't really work, because The system clock doesn't update as fast as your code is requesting new numbers, therefore qui're quite likely to get the same results over and over again until the clock changes. It's probably more noticeable on Unix systems where the usual method of getting the time only gives you second accuracy. And not even microseconds really help as computers are way faster than that by now.
The second problem you want to avoid is linear dependency of pseudo-random values. Imagine you want to place a number of dots in a square, randomly. You'll pick an x and a y coordinate. If your pseudo-random values are a simple linear sequence (like what you'd obtain naïvely from a clock) you'd get a diagonal line with many points clumped together in the same place. That doesn't really work.
One of the simplest types of pseudo-random number generators, the Linear Congruental Generator has a similar problem, even though it's not so readily apparent at first sight. Due to the very simple formula
you'll still get quite predictable results, albeit only if you pick points in 3D space, as all numbers lies on a number of distinct planes (a problem all pseudo-random generators exhibit at a certain dimension):

Computers are fast. I'm over simplifying, but if your clock speed is measured in GHz, it can do billions of operations in 1 second. Relatively speaking, 1 second is an eternity, so it is possible it does not change.
If your program is doing regular operation, it is not guaranteed to sample the clock at a random time. Therefore, you don't get a random number.

Don't forget that for a computer, a single second can be 'an eternity'. Programs / algorithms are often executed in a matter of milliseconds. (1000ths of a second. )
The following pseudocode:
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
fills n a thousand times with a random number between 0 and 1000. On a typical machine, this script executes almost immediatly.
While you typically only initialize the seed at the beginning:
The following pseudocode:
srand(time());
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
initializes the seed once and then executes the code, generating a seemingly random set of numbers. The problem arises then, when you execute the code multiple times. Lets say the code executes in 3 milliseconds. Then the code executes again in 3 millisecnds, but both in the same second. The result is then a same set of numbers.
For the second point: The author probabaly assumes a FAST computer. THe above problem still holds...

He means by that is you are not able to control how fast your computer or any other computer runs your code. So if you suggest 1 second for execution thats far from anything. If you try to run code by yourself you will see that this is executed in milliseconds so even that is not enough to ensure you got random numbers !

load balancing algorithms - special example

Let´s pretend i have two buildings where i can build different units in.
A building can only build one unit at the same time but has a fifo-queue of max 5 units, which will be built in sequence.
Every unit has a build-time.
I need to know, what´s the fastest solution to get my units as fast as possible, considering the units already in the build-queues of my buildings.
"Famous" algorithms like RoundRobin doesn´t work here, i think.
Are there any algorithms, which can solve this problem?

This reminds me a bit of starcraft :D
I would just add an integer to the building queue which represents the time it is busy.
Of course you have to update this variable once per timeunit. (Timeunits are "s" here, for seconds)
So let's say we have a building and we are submitting 3 units, each take 5s to complete. Which will sum up to 15s total. We are in time = 0.
Then we have another building where we are submitting 2 units that need 6 timeunits to complete each.
So we can have a table like this:
Time 0
Building 1, 3 units, 15s to complete.
Building 2, 2 units, 12s to complete.
Time 1
Building 1, 3 units, 14s to complete.
Building 2, 2 units, 12s to complete.
And we want to add another unit that takes 2s, we can simply loop through the selected buildings and pick the one with the lowest time to complete.
In this case this would be building 2. This would lead to Time2...
Time 2
Building 1, 3 units, 13s to complete
Building 2, 3 units, 11s+2s=13s to complete
...
Time 5
Building 1, 2 units, 10s to complete (5s are over, the first unit pops out)
Building 2, 3 units, 10s to complete
And so on.
Of course you have to take care of the upper boundaries in your production facilities. Like if a building has 5 elements, don't assign something and pick the next building that has the lowest time to complete.
I don't know if you can implement this easily with your engine, or if it even support some kind of timeunits.
This will just result in updating all production facilities once per timeunit, O(n) where n is the number of buildings that can produce something. If you are submitting a unit this will take O(1) assuming that you keep the selected buildings in a sorted order, lowest first - so just a first element lookup. In this case you have to resort the list after manipulating the units like cancelling or adding.
Otherwise amit's answer seem to be possible, too.

This is NPC problem (proof at the end of the answer) so your best hope to find ideal solution is trying all possibilities (this will be 2^n possibilities, where n is the number of tasks).
possible heuristic was suggested in comment (and improved in comments by AShelly): sort the tasks from biggest to smallest, and put them in one queue, every task can now take element from the queue when done.
this is of course not always optimal, but I think will get good results for most cases.
proof that the problem is NPC:
let S={u|u is a unit need to be produced}. (S is the set containing all 'tasks')
claim: if there is a possible prefect split (both queues finish at the same time) it is optimal. let this time be HalfTime
this is true because if there was different optimal, at least one of the queues had to finish at t>HalfTime, and thus it is not optimal.
proof:
assume we had an algorithm A to produce the best solution at polynomial time, then we could solve the partition problem at polynomial time by the following algorithm:
1. run A on input
2. if the 2 queues finish exactly at HalfTIme - return True.
3. else: return False
this solution solves the partition problem because of the claim: if the partition exist, it will be returned by A, since it is optimal. all steps 1,2,3 run at polynomial time (1 for the assumption, 2 and 3 are trivial). so the algorithm we suggested solves partition problem at polynomial time. thus, our problem is NPC
Q.E.D.

Here's a simple scheme:
Let U be the list of units you want to build, and F be the set of factories that can build them. For each factory, track total time-til-complete; i.e. How long until the queue is completely empty.
Sort U by decreasing time-to-build. Maintain sort order when inserting new items
At the start, or at the end of any time tick after a factory completes a unit runs out of work:
Make a ready list of all the factories with space in the queue
Sort the ready list by increasing time-til-complete
Get the factory that will be done soonest
take the first item from U, add it to thact factory
Repeat until U is empty or all queues are full.
Googling "minimum makespan" may give you some leads into other solutions. This CMU lecture has a nice overview.
It turns out that if you know the set of work ahead of time, this problem is exactly Multiprocessor_scheduling, which is NP-Complete. Apparently the algorithm I suggested is called "Longest Processing Time", and it will always give a result no longer than 4/3 of the optimal time.
If you don't know the jobs ahead of time, it is a case of online Job-Shop Scheduling
The paper "The Power of Reordering for Online Minimum Makespan Scheduling" says
for many problems, including minimum
makespan scheduling, it is reasonable
to not only provide a lookahead to a
certain number of future jobs, but
additionally to allow the algorithm to
choose one of these jobs for
processing next and, therefore, to
reorder the input sequence.
Because you have a FIFO on each of your factories, you essentially do have the ability to buffer the incoming jobs, because you can hold them until a factory is completely idle, instead of trying to keeping all the FIFOs full at all times.
If I understand the paper correctly, the upshot of the scheme is to
Keep a fixed size buffer of incoming
jobs. In general, the bigger the
buffer, the closer to ideal
scheduling you get.
Assign a weight w to each factory according to
a given formula, which depends on
buffer size. In the case where
buffer size = number factories +1, use weights of (2/3,1/3) for 2 factories; (5/11,4/11,2/11) for 3.
Once the buffer is full, whenever a new job arrives, you remove the job with the least time to build and assign it to a factory with a time-to-complete < w*T where T is total time-to-complete of all factories.
If there are no more incoming jobs, schedule the remainder of jobs in U using the first algorithm I gave.
The main problem in applying this to your situation is that you don't know when (if ever) that there will be no more incoming jobs. But perhaps just replacing that condition with "if any factory is completely idle", and then restarting will give decent results.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio