MPI shared memory access - parallel-processing

In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?

Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.

Related

MPI: Ensure an exclusive access to a shared memory (RMA)

I would like to know which is the best way to ensure an exclusive access to a shared resource (such as memory window) among n processes in MPI. I've tried MPI_Win_lock & MPI_Win_fence but they don't seem to work as expected, i.e: I can see that multiple processes enter a critical region (code between MPI_Win_lock & MPI_Win_unlock that contains MPI_Get and/or MPI_Put) at the same time.
I would appreciate your suggestions. Thanks.
In MPI 2 you cannot truly do atomic operations. This is introduced in MPI 3 using MPI_Fetch_and_op. This is why your critical data is modified.
Furthermore, take care with `MPI_Win_lock'. As described here:
The name of this routine is misleading. In particular, this routine need not block, except when the target process is the calling process.
The actual blocking process is MPI_Win_unlock, meaning that only after returning from this procedure you can be sure that the values from put and get are correct. Perhaps this is better described here:
MPI passive target operations are organized into access epochs that are bracketed by MPI Win lock and MPI Win unlock calls. Clever MPI implementations [10] will combine all the data movement operations (puts, gets, and accumulates) into one network transaction that occurs at the unlock.
This same document can also provide a solution to your problem, which is that critical data is not written atomically. It does this through the use of a mutex, which is a mechanism that ensures only one process can access data at the time.
I recommend you read this document: The solution they propose is not difficult to implement.

Are OpenCL workgroups executed simultaneously?

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

Is there a way to end idle threads in GNU OpenMP?

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.
Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.
I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.
In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

Parallelism (Threads and Processes)

I have a question. I know the differece between a thread and a process in theory. But I still don't understand when we should use the first and when the latter. For example, we have a difficult task which needs to be parelleled. But in which way? Which is faster and MORE EFFECTIVE and in what cases? Should we split our task into a few processes or into a few threads? Could you give a few examples? I know that my question may seem silly, but I'm new to the topic of parallel computing. I hope that you understand my question. Thank you in advance.
In general, there is only one main difference between processes and threads: All threads of a given process share the same virtual address space. Whereas each process has its own virtual address space.
When dealing with problems that require concurrent access to the same set of data, it is easier to use threads, because they can all directly access the same memory.
Threads share memory. Processes do not.
This means that processes are somewhat more expensive to start up. It also means that threads can conveniently communicate through shared memory, and processes cannot.
However, from a coding perspective, it also means that threads are significantly more difficult to program correctly. It's very easy for threads to stomp on each others' memory in unintended ways. Processes are somewhat safer.
Welcome to the world of concurrency!
There is no theoretical difference between threads and processes that is practical to generalize from. There are many, many different ways to implement threads, including ways that nearly mirror those of processes (e.g. Linux threads). Then there's lightweight threading, which involves the process managing the threading by itself; but there's more variation there, because you can then have either co-operative or semi-preemptive threading model.
For example, we describe Haskell's threading model and Python's.
Haskell offers lightweight threads that introduce little runtime overhead; there are well-defined points at which threads may yield control, but this is largely hidden from the user, giving the appearance of pre-emptive multitasking. Shared state is held in specially typed variables that are treated specially by the language. Because of this, multi-threaded, even concurrent programs can be written in a largely single-threaded way, then forked from the main process. So there, threads are and abstraction mechanism, and may even be beneficial in a single-(OS)-threaded process to model the program; however, it scales well to N-threads, where N may be chosen dynamically. (And N Haskell threads are mapped dynamically to OS threads.)
Python allows threading, but with a huge bottleneck: the Global Interpreter lock. Therefore, to gain serious performance benefits, one must use processes in practice. There is no feasible, performant threading model to speak of.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Resources