I would like to know which is the best way to ensure an exclusive access to a shared resource (such as memory window) among n processes in MPI. I've tried MPI_Win_lock & MPI_Win_fence but they don't seem to work as expected, i.e: I can see that multiple processes enter a critical region (code between MPI_Win_lock & MPI_Win_unlock that contains MPI_Get and/or MPI_Put) at the same time.
I would appreciate your suggestions. Thanks.
In MPI 2 you cannot truly do atomic operations. This is introduced in MPI 3 using MPI_Fetch_and_op. This is why your critical data is modified.
Furthermore, take care with `MPI_Win_lock'. As described here:
The name of this routine is misleading. In particular, this routine need not block, except when the target process is the calling process.
The actual blocking process is MPI_Win_unlock, meaning that only after returning from this procedure you can be sure that the values from put and get are correct. Perhaps this is better described here:
MPI passive target operations are organized into access epochs that are bracketed by MPI Win lock and MPI Win unlock calls. Clever MPI implementations [10] will combine all the data movement operations (puts, gets, and accumulates) into one network transaction that occurs at the unlock.
This same document can also provide a solution to your problem, which is that critical data is not written atomically. It does this through the use of a mutex, which is a mechanism that ensures only one process can access data at the time.
I recommend you read this document: The solution they propose is not difficult to implement.
In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?
Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.
I have a simple multi-threaded app for my multi-core system. This app has a parallel region in which no threads write to a given memory address, but some may read simultaneously.
Will there still be some type of overhead or performance hit associated with several threads accessing the same memory even if though no locking is used? If so, why? How big an impact can it have and what can be done about it?
This can depend on the specific cache synchronization protocol in use, but most modern CPUs support having the same cache line shared in multiple processor caches, provided there is no write activity to the cache line. That said, make sure you align your allocations to the cache line size; if you don't, it's possible that data that's being written to could share the same cache line as your read-only data, resulting in a performance hit when the dirtied cache line is flushed on other processors (false sharing).
I would say there wouldn't be. However the problem arises when you have multiple writers to the same references.
Is it possible to share memory between two MFC C++ applications without using Memory mapped files? Currently we are using this method to share structs, and it is too slow for our requirements. Is there a better way?
Are you sure it is the memory mapped files that are slow? The OS maps the same piece of RAM into both process spaces (when it's paged in.) Other culprits to performance issues could be mutexes and other synchronization primitives/volatile reads and cache invalidation to propagate concurrent changes to memory between processes.
You might try making changes locally to a non-shared region, and then bulk copying that, rather than repeatedly writing to the shared memory.
Other alternatives are message passing, RPC or DCOM, but I doubt these will be more performant, especially if the amount of data being transferred/referenced is large.
I would have thought that once you'd established the memory mapping (with the MapViewOfFile), that would be pretty fast.
Is your performance problem with actually setting up the mapped memory, rather than using it once it's set up?
If you do genuinely have a verified problem with the memory mapped files, this is another technique: http://msdn.microsoft.com/en-us/library/h90dkhs0%28VS.80%29.aspx (DLL shared memory segments), but I doubt it's really going to help you.
the other day a colleague of mine stated that using static classes can cause performance issues on multi-core systems, because the static instance cannot be shared between the processor caches. Is that right? Are there some benchmarks around proofing this statement? This statement was made in the context of .Net development (with C#) related discussion, but it sounds to me like a language and environment independent problem.
Thx for your comments.
I would push your colleague for data or at least references.
The thing is, if you've got shared data, you've got shared data. Whether that's exposed through static classes, a singleton, whatever, isn't terribly important. If you don't need the shared data in the first place, I expect you wouldn't have a static class anyway.
Besides all of this, in any given application there's likely to be a much bigger bottleneck than processor caches for shared data in static classes.
As ever, write the most sensible, readable, maintainable code first - then work out if you have a performance bottleneck and act accordingly.
"[a] static instance cannot be shared between the processor caches. Is that right?"
That statement doesn't make much sense to me. The point of each processor's dedicated cache is that it contains a private copy of a small patch of memory, so that if the processor is doing some algorithm that only needs to access that particular memory region then it doesn't have to go to keep going back to access the external memory. If we're talking about the static fields inside a static class, the memory for those fields may all fit into a contiguous chunk of memory that will in turn fit into a single processor's (or core's) dedicated cache. But they each have their own cached copy - it's not "shared". That's the point of caches.
If an algorithm's working set is bigger than a cache then it will defeat that cache. Meaning that as the algorithm runs, it repeatedly causes the processor to pull data from external memory, because all the necessary pieces won't fit in the cache at once. But this is a general problem that doesn't apply specifically to static classes.
I wonder if your colleague was actually talking not about performance but about the need to apply correct locking if multiple threads are reading/writing the same data?
If multiple threads are writing to that data, you'll have cache thrashing (the write on one CPU's cache invalidates the caches of the other CPUs). Your friend is technically correct, but there's a good chance it's not your primary bottleneck, so it doesn't matter.
If multiple threads are reading the data, your friend is flat-out wrong.
If you don't use any kind of locks or synchronization then static-vs.-non-static won't have any influence on your performance.
If you're using synchronization then you could run into a problem if all threads need to acquire the same lock, but that's only a side-effect of the static-ness and not a direct result of the methods being static.
In any "virtual machine" controlled language (.NET, Java, etc) this control is likely delegated to the underlying OS and likely further down to the BIOS and other scheduling controls. That being said, in the two biggies, .NET and Java, static vs. non-static is a memory issue, not a CPU issue.
Re-iterating saua's point, the impact on the CPU comes from the synchronization and thread control, not the access to the static information.
The problem with CPU cache management is not limited to only static methods. Only one CPU can update any memory address at a time. An object in your virtual machine, and specifically a field in your object, is a pointer to said memory address. Thus, even if I have a mutable object Foo, calling setBar(true) on Foo will only be allowed on a single CPU at a time.
All that being said, the point of .NET and Java is that you shouldn't be spending your time sweating these problems until you can prove that you have a problem and I doubt you will.
if you share mutable data between threads, you need either a lock or a lock-free algorithm (seldom available, and sometimes hard to use, unfortunately).
having few, widely used, lock-arbitrated resources can get you to bottlenecks.
static data is similar to a single-instance resource.
therefore:
if many threads access static data, and you use a lock to arbitrate, your threads are going to fight for access.
when designing a highly multithreaded app, try to use many fine-grained locks. split your data so that a thread can grab one piece and run with it, hopefully no other thread will need to wait for it because they're busy with their own pieces of data.
x86 architecture implements cache-snooping to keep data caches in sync on writes should they happen to cache the same thing... Not all architectures do that in hardware, some depend on software to make sure that the case never occurs.
Even if it were true, I suspect you have plenty of better ways to improve performance. When it gets down to changing static to instance, for processor caching, you'll know you are really pushing the envelope.