I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!
Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.
Related
I have an application where I have 3 different processes that need to run concurrently, in 3 different languages, on Windows:
A "data gathering" process, which interfaces with a sensor array. The developers of the sensor array have been kind enough to provide us with their C# source code, which I can modify. This should be generating raw data and shoving it into shared memory
A "post-processing" process. This is C++ code that uses CUDA to get the processing done as fast as possible. This should be taking raw data, moving it to the GPU, then taking the results from the GPU and communicating it to--
A feedback controller written in Matlab, which takes the results of the post-processing and uses it to make decisions on how to control a mechanical system.
I've done coursework on parallel programming, but that coursework all worked in Linux, where I used the mmap.h for coordination between multiple processes. This makes sense to me--you ask the OS for a page in virtual memory to be mapped to shared physical memory addresses, and the OS gives you some shared memory.
Googling around, it seems like the preferred way to set up shared memory between processes in Windows (in fact, the only "easy" way to do it in Matlab) is to use memory-mapped files, but this seems completely bonkers to me. If I'm understanding correctly, a memory-mapped file grabs some disk space and maps it to the physical address space, which is then mapped into the virtual address space for any process that accesses the same memory-mapped file.
This seems about three times more complex than it needs to be just to get multiple processes to map pages in their virtual address space to the same physical memory. I don't feel like I should be doing anything remotely related to disk I/O for what I'm trying to accomplish, especially since performance is a big issue for me (ideally I should be able to process 1000 sets of data per second, though that's not a hard limit). Is this really the right way to coordinate my processes?
If all my processors share the same memory, is using MPI anyhow useful, instead of going full OpenMP ?
If you never intend to scale your application beyond a single shared-memory node, then OpenMP parallelisation might be relatively easier to implement in comparison to MPI parallelisation. Relatively, because the apparent simplicity of OpenMP is very misleading. In order to utilise the full ability of modern shared-memory machines, one should maximise data locality and use lots of private data, effectively treating them (the machines) as distributed memory systems. Also, the most prevailing error in shared memory programming are data races and those in times could be very hard to debug, even when armed with special thread-checker tools. Data races are virtually absent in MPI programming since processes do not share data.
That said, even when MPI processes communicate using shared memory, that is still slower than directly accessing the shared memory in a threaded process. Also some algorithms require some global data, which takes more memory with MPI where each process has to hold a copy of that data. This is curable in MPI-3.0 using shared-memory windows with single-sided operations, but that's somehow cumbersome (though portable). Also there are research efforts to reduce the intra-node communication overhead to as little as possible and some are very successful.
I need to write an application that hashes words from a dictionary to make WPA pre-shared-keys. This is my thesis for a "Networking Security" course. The application needs to be parallel for increased performance. I have some experience with MPI from my IT studies but I would like to tie it up with CUDA. The idea is to use MPI to distribute the load evenly to the nodes of the cluster and then utilize CUDA to run the individual chunks in parallel inside the GPUs of the nodes.
Distributing the load with MPI is something I can easily do and have done in the past. Also computing with CUDA is something I can learn. There is also a project (pyrit) that does more or less what I need to do (actually a lot more) and I can get ideas from there.
I would like some advice on how to make the connection between MPI and CUDA. If there is somebody that has built anything like this I would greatly appreciate his advice and suggestions. Also if you happen to know of any resources on the topic please do point them to me.
Sorry for the lengthy intro but I thought it was necessary to give some background.
This question is largerly open-ended and so it's hard to give a definitive answer. This one is just a summary of the comments made High Performance Mark, me and Jonathan Dursi. I do not claim authorship and thus made this answer a community wiki.
MPI and CUDA are orthogonal. The former is an IPC middleware and is used to communicate between processes (possibly residing on separate nodes) while the latter provides highly data-parallel shared-memory computing to each process that uses it. You can break the task into many small subtasks and use MPI to distribute them to worker processes running on the network. The master/worker approach is suitable for this kind of application, especially if words in the dictionary vary greatly in their length and variance in processing time is to be expected. Provided with all the necessary input values, worker processes can then use CUDA to perform the necessary computations in parallel and then return results back using MPI. MPI also provides the mechanisms necessary to launch and control multinode jobs.
Although MPI and CUDA could be used separately, modern MPI implementations provide some mechanisms that blur the boundaries between those two. It could be either direct support for device pointers in MPI communication operations that transparently call CUDA functions to copy memory when necessary or it could be even support for RDMA to/from device memory without intermediate copy to main memory. The former simplifies your code while the latter can save different amount of time, depending on how your algorithm is structured. The latter also requires both failry new CUDA hardware and drivers and newer networking equipment (e.g. newer InfiniBand HCA).
MPI libraries that support direct GPU memory operations include MVAPICH2 and the trunk SVN version of Open MPI.
I do not quite understand the benefit of "multiple independent virtual address, which point to the same physical address", even though I read many books and posts,
E.g.,in a similar question Difference between physical addressing and virtual addressing concept,
The post claims that program will not crash each other, and
"in general, a particular physical page only maps to one application's
virtual space"
Well, in http://tldp.org/LDP/tlk/mm/memory.html, in section "shared virtual memory", it says
"For example there could be several processes in the system running
the bash command shell. Rather than have several copies of bash, one
in each processes virtual address space, it is better to have only one
copy in physical memory and all of the processes running bash share
it."
If one physical address (e.g., shell program) mapped to two independent virtual addresses, how can this not crash? Wouldn't it be the same as using the physical addressing?
what does virtual addressing provide, which is not possible or convenient from physical addressing? If no virtual memory exists, i.e., two directly point to the same physical memory? i think, by using some coordinating mechanism, it can still work. So why bother "virtual addressing, MMU, virtual memory" these stuff?
There are two main uses of this feature.
First, you can share memory between processes, that can communicate via the shared pages. In facts, shared memory is one of the simplest forms of IPC.
But shared readonly pages can also be used to avoid useless duplication: most of times, the code of a program does not change after it has been loaded in memory, so its memory pages can be shared among all the processes that are running that program. Obviously only the code is shared, the memory pages containing the stack, the heap and in general the data (or, if you prefer, the state) of the program are not shared.
This trick is improved with "copy on write". The code of executables usually doesn't change when running, but there are programs that are actually self-modifying (they were quite common in the past, when most of the development was still done in assembly); to support this stuff, the operating system does read-only sharing as explained before, but, if it detects a write on one of the shared pages, it disables the sharing for such page, creating an independent copy of it and letting the program write there.
This trick is particularly useful in situations in which there's a good chance that the data won't change, but it may happen.
Another case in which this technique can be used is when a process forks: instead of copying every memory page (which is completely useless if the child process does immediately an exec) , the new process shares with the parent all its memory pages in copy on write mode, allowing quick process creation, still "faking" the "classic" fork behavior.
If one physical address (e.g., shell program) mapped to two independent virtual addresses
Multiple processes can be built to share a piece of memory; e.g. with one acting as a server that writes to the memory, the other as a client reading from it, or with both reading and writing. This is a very fast way of doing inter-process communication (IPC). (Other solutions, such as pipes and sockets, require copying data to the kernel and then to the other process, which shared memory skips.) But, as with any IPC solution, the programs must coordinate their reads and writes to the shared memory region by some messaging protocol.
Also, the "several processes in the system running the bash command shell" from the example will be sharing the read-only part of their address spaces, which includes the code. They can execute the same in-memory code concurrently, and won't kill each other since they can't modify it.
In the quote
in general, a particular physical page only maps to one application's virtual space
the "in general" part should really be "typically": memory pages are not shared unless you set them up to be, or unless they are read-only.
How is memory allocated in slave nodes for execution of MPI programs ? How do slave nodes know the amount of memory to reserve ? What happens when a slave node can't find the data that it wants to access ?
This is not a homework problem , but a question that I tried came up in my mind and could'nt find on googling
With a non-specific question, the best answer you can expect will also be non-specific
When programming using MPI you typically write a single program which is launched (via mpirun/mpiexec, or some batching system eg. torque) on a set of notes.
The master-slave model is but one approach.
The memory allocation is typically under program control, just as you would in any application allocate memory as needed, so to in your MPI program.
As to finding the data, it is often provided to them (directly or indirectly) (by the master
process, if the master-slave model is used). If indeed each MPI instance has to "search" for the data it is to be processing, then as with any program that is unable to find what it requires, it should send a suitable error message/status back to the caller (or the master process)
.PMCD.