How does MPI_Comm_split_type know what processes on your computer can create shared memory? - parallel-processing

I'm looking into the MPI one way communication, specifically shared memory. Before you allocate a section of memory to be shared between processors, you need to split them into groups which are able to share memory. This is done using the function MPI_Comm_split_type.
This is supposed to do everything for you; split your communicator into groups of processors which can share memory, and return the new communicator to you.
My question is how does it know if two processors can share memory? Do I need to have something set up properly on my end in order for it to be able to accurately determine the memory layout of my system?
So far when I've used it it seems to create shared memory the way I would expect. I'm just worried about porting the code to different systems if it will continue to correctly identify the processors

Related

Trying to share memory in windows across multiple processes, multple languages

I have an application where I have 3 different processes that need to run concurrently, in 3 different languages, on Windows:
A "data gathering" process, which interfaces with a sensor array. The developers of the sensor array have been kind enough to provide us with their C# source code, which I can modify. This should be generating raw data and shoving it into shared memory
A "post-processing" process. This is C++ code that uses CUDA to get the processing done as fast as possible. This should be taking raw data, moving it to the GPU, then taking the results from the GPU and communicating it to--
A feedback controller written in Matlab, which takes the results of the post-processing and uses it to make decisions on how to control a mechanical system.
I've done coursework on parallel programming, but that coursework all worked in Linux, where I used the mmap.h for coordination between multiple processes. This makes sense to me--you ask the OS for a page in virtual memory to be mapped to shared physical memory addresses, and the OS gives you some shared memory.
Googling around, it seems like the preferred way to set up shared memory between processes in Windows (in fact, the only "easy" way to do it in Matlab) is to use memory-mapped files, but this seems completely bonkers to me. If I'm understanding correctly, a memory-mapped file grabs some disk space and maps it to the physical address space, which is then mapped into the virtual address space for any process that accesses the same memory-mapped file.
This seems about three times more complex than it needs to be just to get multiple processes to map pages in their virtual address space to the same physical memory. I don't feel like I should be doing anything remotely related to disk I/O for what I'm trying to accomplish, especially since performance is a big issue for me (ideally I should be able to process 1000 sets of data per second, though that's not a hard limit). Is this really the right way to coordinate my processes?

Using shared memory in ArrayFire

Does anyone know how to declare that an array of data in ArrayFire should be stored in shared memory instead of global memory? Is this possible? I have a small set of data that needs to be randomly accessible by all threads. It's a constant look-up table that should be available for the life of the application. Maybe I am just missing the obvious or something, but reading the ArrayFire docs and googling have not turned up any info on how I tell ArrayFire that my data needs to go into shared memory.
In CUDA Shared memory (Local memory in OpenCL) is a very fast type of memory that is located on the GPU. It has the same lifetime as on thread block and can only be accessed by threads in the same thread block. It therefore cannot be used to store persistent data which needs to be used by multiple kernels even in raw CUDA. You might want to look into constant or texture memory to implement a look up table(LUT). These memory types are usually more suited for the type of access you usually encounter with a LUT.
ArrayFire has a high level API which makes GPU programming easy with one of the fastest implementations of many commonly used functions. With ArrayFire you will not be able to specify which type of memory is created but you are free to use the data in your own kernel. If you are using one of our function then it is very likely we will make use of shared/texture/constant memory where it makes sense.
Umar
Disclosure: I am one of the developers of ArrayFire

what does it mean configuring MPI for shared memory?

I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!
Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.

Why code segment is common for different instances of same program

I wanted to know why code segment is common for different instances of same program.
For example: consider program P1.exe running, if another copy of P1.exe is running, code segment will be common for both running instances. Why is it so?
If the code segment in question is loaded from a DLL, it might be the operating system being clever and re-using the already loaded library. This is one of the core points of using dynamically loaded library code, it allows the code to be shared across multiple processes.
Not sure if Windows is clever enough to do this with the code sections of regular EXE files, but it would make sense if possible.
It could also be virtual memory fooling you; two processes can look like they have the same thing on the same address, but that address is virtual, so they really are just showing mappings of physical memory.
Code is typically read-only, so it would be wasteful to make multiple copies of it.
Also, Windows (at least, I can't speak for other OS's at this level) uses the paging infrastructure to page code in and out direct from the executable file, as if it were a paging file. Since you are dealing with the same executable, it is paging from the same location to the same location.
Self-modifying code is effectively no longer supported by modern operating systems. Generating new code is possible (by setting the correct flags when allocating memory) but this is separate from the original code segment.
The code segment is (supposed to be) static (does not change) so there is no reason not to use it for several instances.
Just to start at a basic level, Segmentation is just a way to implement memory isolation and partitioning. Paging is another way to achieve this. For the most part, anything you can achieve via segmentation, you can be achieve via paging. As such, most modern operating systems on the x86 forego using segmentation at all, instead relying completely on paging facilities.
Because of this, all processes will usually be running under the trivial segment of (Base = 0, Limit = 4GB, Privilege level = 3), which means the code/data segment registers play no real part in determining the physical address, and are just used to set the privilege level of the process. All processes will usually be run at the same privilege, so they should all have the same value in the segment register.
Edit
Maybe I misinterpreted the question. I thought the question author was asking why both processes have the same value in the code segment register.

What is the purpose of allocating pages in the pagefile with CreateFileMapping?

The function CreateFileMapping can be used to allocate space in the pagefile (if the first argument is INVALID_HANDLE_VALUE). The allocated space can later be memory mapped into the process virtual address space.
Why would I want to do this instead of using just VirtualAlloc?
It seems that both functions do almost the same thing. Memory allocated by VirtualAlloc may at some point be pushed out to the pagefile. Why should I need an API that specifically requests that my pages be allocated there in the first instance? Why should I care where my private pages live?
Is it just a hint to the OS about my expected memory usage patterns? (Ie, the former is a hint to swap out those pages more aggressively.)
Or is it simply a convenience method when working with very large datasets on 32-bit processes? (Ie, I can use CreateFileMapping to make >4Gb allocations, then memory map smaller chunks of the space as needed. Using the pagefile saves me the work of manually managing my own set of files to "swap" to.)
PS. This question is sparked by an article I read recently: http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx
From the CreateFileMappingFunction:
A single file mapping object can be shared by multiple processes.
Can the Virtual memory be shared across multiple processes?
One reason is to share memory among different processes. Different processes by only knowing the name of the mapping object can communicate over page file. This is preferable over creating a real file and doing the communications. Of course there may be other use cases. You can refer to Using a File Mapping for IPC at MSDN for more information.

Resources