MPI: shared variable value for all processors - parallel-processing

Here's one question about MPI. I need two processors that keeps modifying one variable and I want both processors to have access to the variable with the most up-to-date value.
from mpi4py import MPI
from time import sleep
comm = MPI.COMM_WORLD
rank = comm.rank
assert comm.size == 2
msg = 0
sec = 10
if comm.rank == 0:
for i in range(sec):
print msg
sleep(1)
msg = comm.bcast(msg,root = 1)
else:
for i in range(sec*2):
msg += 1
sleep(0.5)
comm.bcast(msg,root = 1)
So I'm expecting the program to print something like: 0 2 4 ...
But the program turns out to print: 0 1 2 3 4 5 6 7 8 9
I'm curious if there's a mechanism in mpi4py such that the variable msg is shared by both processors? That is, whenever msg is modified by processor 1, the new value becomes immediately available to processor 0. In other words, I want processor 0 to access the most-up-to-date value of msg instead of waiting for every changes that were made on msg by processor 1.

I think you're getting confused about how distributed memory programming works. In MPI, each process (or rank) has its own memory, and therefore when it changes values via load/store operations (like what you're doing with msg += 1), it will not affect the value of the variable on another process. The only way to update remote values is by sending messages, which you are doing with the comm.bcast() call. This sends the local value of msg from rank 1 to all other ranks. Until this point, there's no way for rank 0 to know what's been happening on rank 1.
If you want to have shared values between processes, then you probably need to take a look at something else, perhaps threads. You'll lose the distributed abilities of MPI if you switch to OpenMP, but that might not be what you needed MPI for in the first place. There are ways of doing this with distributed memory models (such as PGAS languages like Unified Parallel C, Global Arrays, etc.), but you will always run into the issue of latency which means that there will be some time that the values on ranks 0 and 1 are not synchronized unless you have some sort of protection to enforce it.

As mentioned by Wesley Bland, this isn't really possible in a pure distributed memory environment, as memory isn't shared.
However, MPI has for some time (since 1997) allowed something like this in the MPI-2, as one-sided communications; these have been updated significantly in MPI-3 (2012). This approach can have real advantages, but one has to be a little careful; since memory isn't really shared, every update requires expensive communications and it's easy to accidentally put significant scalability/performance bottlenecks in your code by over-reliance on shared state.
The Using MPI-2 book has an example of implementing a counter using the MPI-2 one-sided communications; a simple version of that counter is described and implemented in this answer in C. In the mpi4py distribution, under 'demos', there are implementations of these same counters in the 'nxtval' demo; the same simple counter as nxtval-onesided.py and a more complicated but more scalable implementation, also as described in the Using MPI-2 book, as nxtval-scalable.py. You should be able to use either of those implementations more or less as-is in the above code.

Related

Data races with MESI optimization

I dont really understand what exactly is causing the problem in this example:
Here is a snippet from my book:
Based on the discussion of the MESI protocol in the preceding section, it would
seem that the problem of data sharing between L1 caches in a multicore machine
has been solved in a watertight way. How, then, can the memory ordering
bugs we’ve hinted at actually happen?
There’s a one-word answer to that question: Optimization. On most hardware,
the MESI protocol is highly optimized to minimize latency. This means
that some operations aren’t actually performed immediately when messages
are received over the ICB. Instead, they are deferred to save time. As with
compiler optimizations and CPU out-of-order execution optimizations, MESI
optimizations are carefully crafted so as to be undetectable by a single thread.
But, as you might expect, concurrent programs once again get the raw end of
this deal.
For example, our producer (running on Core 1) writes 42 into g_data and
then immediately writes 1 into g_ready. Under certain circumstances, optimizations
in the MESI protocol can cause the new value of g_ready to become
visible to other cores within the cache coherency domain before the updated
value of g_data becomes visible. This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet. This means that the consumer (on Core 2) can potentially
see a value of 1 for g_ready before it sees a value of 42 in g_data, resulting in
a data race bug.
Here is the code:
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
g_data = 42;
// assume no instruction reordering across this line
g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
while (!g_ready)
PAUSE();
// assume no instruction reordering across this line
ASSERT(g_data == 42);
}
How can g_data be computed but not present in the cache?
This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet.
If g_data is not in cache, then why does the previous sentece end with a yet? Would the CPU load the cache line with g_data after it has been computed?
If we read this sentence:
This means that some operations aren’t actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time.
Then what operation is deferred in our example with producer and consumer threads?
So basically I dont understand how under the MESI protocol, some operations are visible to other cores in the wrong order, despite being computed in the right order by a specific core.
PS:
This example is from a book called "Game Engine Architecture, Third Edition" by Jason Gregory, its on the page 309. Here is the book

Python Multiprocessing slight slower than Multithreading on Windows

I have been experimenting my code to send "parallel" commands to multiple serial COM ports.
My multi-threading code consists of:
global q
q = Queue()
devices = [0, 1, 2, 3]
for i in devices:
q.put(i)
cpus=cpu_count() #detect number of cores
logging.debug("Creating %d threads" % cpus)
for i in range(cpus):
t = Thread(name= 'DeviceThread_'+str(i), target=testFunc1)
t.daemon = True
t.start()
and multi-processing code consists of:
devices = [0, 1, 2, 3]
cpus=cpu_count() #detect number of cores
pool = Pool(cpus)
results = pool.map(multi_run_wrapper, devices)
I observe that the task of sending serial commands to 4 COM ports in "parallel" takes about 6 seconds and multi-processing always always takes a 0.5 to 1 second of additional total run time.
Any inputs on why the discrepancy on a Windows machine?
Well, for one, you're not comparing apples to apples. If you want equivalent code, use multiprocessing.dummy.Pool in your threaded case (which is the same as multiprocessing.Pool implemented in terms of threads, not processes), so you're at least using the same basic parallelization model with different internal implementations, not changing everything all at once.
Beyond that, launching the workers and communicating data to them has some overhead, more on Windows than on other systems since Windows can't fork to spawn new processes cheaply; it has to spawn a new Python instance and then copy over state via IPC to approximate forking.
Aside from that, you haven't provided enough information; your process and thread based worker functions aren't provided, and could cause significant differences in behavior. Nor have you provided information on how you're performing timing. Similarly, if each worker process needs to reinitialize the COM port communication library, that could involve non-trivial overhead.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

Minimizing global memory reads in OpenCL with vectors?

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

Go-lang parallel segment runs slower than series segment

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.
There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.
Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.
I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Resources